Indian Flag
Government Of India
A-
A
A+
Pralekha

Pralekha

Pralekha is a large-scale parallel document dataset spanning across 11 Indic languages and English

About Dataset

Pralekha covers 12 languages—Bengali (ben), Gujarati (guj), Hindi (hin), Kannada (kan), Malayalam (mal), Marathi (mar), Odia (ori), Punjabi (pan), Tamil (tam), Telugu (tel), Urdu (urd), and English (eng). It includes a mixture of high- and medium-resource languages, covering 11 different scripts. The dataset spans two broad domains: News Bulletins (Indian Press Information Bureau (PIB)) and Podcast Scripts (Mann Ki Baat (MKB)), offering both written and spoken forms of data. All the data is human-written or human-verified, ensuring high quality.

While this accounts for alignable (parallel) documents, In real-world scenarios, multilingual corpora often include unalignable documents. To simulate this for CLDA evaluation, we sample unalignable documents from Sangraha Unverified, selecting 50% of Pralekha’s size to maintain a 1:2 ratio of unalignable to alignable documents.

For Machine Translation (MT) tasks, we first randomly sample 1,000 documents from the alignable subset per English-Indic language pair for each development (dev) and test set, ensuring a good distribution of varying document lengths. After excluding these sampled documents, we use the remaining documents as the training set for training document-level machine translation models.

Activity Overview Activity Overview

  • Downloads0
  • Redirect 8
  • Views 28
  • File Size 0

Tags Tags

  • Parallel Corpus
  • Machine Translation
  • multilingual NLP
  • Indic Languages
  • document-alignment

License Control License Control

Attribution 4.0 International (CC BY- 4.0)