ORGANISATION

Itihasa

Itihāsa is a Sanskrit-English translation corpus containing 93,000 Sanskrit shlokas and their English translations extracted from M. N. Dutt's seminal works on The Ramayana and The Mahabharata

About Dataset

The dataset contains over 93,000 aligned Sanskrit and English text pairs collected from classical Indian epics (Mahabharata and Ramayana). Each record includes metadata such as book, volume, chapter, shloka number, Sanskrit verse, and corresponding English translation.This dataset provides a unique collection of Sanskrit translations of the Ramayana, along with analysis and commentary, and extracts from the Ramayana book. The dataset is primarily focused on the text of the Ramayana and its translations, with an emphasis on understanding the cultural and historical context in which it was written. The citation of the dataset - Aralikatte, Rahul; Miryam de Lhoneux; Anoop Kunchukuttan; and Anders Søgaard. “Itihasa: A Large-Scale Corpus for Sanskrit to English Translation.” In Proceedings of the 8th Workshop on Asian Translation (WAT2021), August 2021, Online. Published by the Association for Computational Linguistics, pp. 191–197. Available at: Itihasa Paper – https://aclanthology.org/2021.wat-1.22/. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).

Purpose of Dataset

The Purpose Of This Dataset Is Academic Research, Language Learning, Cultural Analysis, And Educational Purposes. Additionally, The Dataset Could Be Used To Develop Natural Language Processing Models, Machine Translation, Text Analysis, And To Preserve The Cultural Heritage Of Ancient India.