Pralekha covers 12 languages—Bengali (ben), Gujarati (guj), Hindi (hin), Kannada (kan), Malayalam (mal), Marathi (mar), Odia (ori), Punjabi (pan), Tamil (tam), Telugu (tel), Urdu (urd), and English (eng).
It includes a mixture of high- and medium-resource languages, covering
11 different scripts. The dataset spans two broad domains: News Bulletins (Indian Press Information Bureau (PIB)) and Podcast Scripts (Mann Ki Baat (MKB)), offering both written and spoken forms of data. All the data is human-written or human-verified, ensuring high quality.
While this accounts for alignable (parallel) documents, In real-world scenarios, multilingual corpora often include unalignable documents. To simulate this for CLDA evaluation, we sample unalignable documents from Sangraha Unverified, selecting 50% of Pralekha’s size to maintain a 1:2 ratio of unalignable to alignable documents.
For Machine Translation (MT) tasks, we first randomly sample 1,000 documents from the alignable
subset per English-Indic language pair for each development (dev) and
test set, ensuring a good distribution of varying document lengths.
After excluding these sampled documents, we use the remaining documents
as the training set for training document-level machine translation
models.
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.