The-LTRC-Hindi-Telugu-Parallel-Corpus

The-LTRC-Hindi-Telugu-Parallel-Corpus developed under ILMT - PILOT

About Dataset

The-LTRC-Hindi-Telugu-Parallel-Corpus developed under ILMT - PILOT funded by MEITY

Dataset Structure:
File name contains source language and target language based on train and test splits.
{'domain', 'source_language', 'target_language', 'source_text', 'target_text'}

Dataset Size and Domains :
506178 parallel sentences for Chemistry, Law, News & General, HealthCare, Education Others, open education books

Data Source:
Educational Lectures

Details:
Curated by: LTRC, IIIT Hyderabad, India
Funded by: MEITY, GOI, India
Shared by: MT-NLP, LTRC, IIIT Hyderabad, India
Language(s) (NLP): tel_Telu, hin_Deva
Paper: The LTRC Hindi-Telugu Parallel Corpus; Vandan Mujadia, Dipti Sharma

Project Investigator:
Prof. Dipti Misra Sharma, LTRC, IIIT Hyderabad

Data Curators:
LTRC Language Experts

BibTeX:

@inproceedings{mujadia-sharma-2022-ltrc,
    title = "The {LTRC} {H}indi-{T}elugu Parallel Corpus",
    author = "Mujadia, Vandan  and
      Sharma, Dipti",
    editor = "Calzolari, Nicoletta  and
      B{\'e}chet, Fr{\'e}d{\'e}ric  and
      Blache, Philippe  and
      Choukri, Khalid  and
      Cieri, Christopher  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Isahara, Hitoshi  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, H{\'e}l{\`e}ne  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.365",
    pages = "3417--3424",
    abstract = "We present the Hindi-Telugu Parallel Corpus of different technical domains such as Natural Science, Computer Science, Law and Healthcare along with the General domain. The qualitative corpus consists of 700K parallel sentences of which 535K sentences were created using multiple methods such as extract, align and review of Hindi-Telugu corpora, end-to-end human translation, iterative back-translation driven post-editing and around 165K parallel sentences were collected from available sources in the public domain. We present the comparative assessment of created parallel corpora for representativeness and diversity. The corpus has been pre-processed for machine translation, and we trained a neural machine translation system using it and report state-of-the-art baseline results on the developed development set over multiple domains and on available benchmarks. With this, we define a new task on Domain Machine Translation for low resource language pairs such as Hindi and Telugu. The developed corpus (535K) is freely available for non-commercial research and to the best of our knowledge, this is the well curated, largest, publicly available domain parallel corpus for Hindi-Telugu.",
}