ORGANISATION

CoRil-Parallel

Indic Parallel Corpus: 11 Indian Language Pairs for Machine Translation

About Dataset

Indic Parallel Corpus: 11 Indian Language Pairs for Machine Translation

This dataset contains parallel corpus for machine translation across 11 Indian language pairs. The data is curated to cover three distinct domains: Governance, Health, and General. This dataset is designed to help researchers and developers build and evaluate robust machine translation models for Indian languages.

Description:
The corpus provides parallel sentences for a variety of language pairs, with a focus on Hindi as a pivot language. All translation pairs are bidirectional. The data has been sourced and cleaned to be useful for training Neural Machine Translation (NMT) models.

The dataset includes the following 11 language pairs:

Source Language	Target Language	Language Codes
Hindi	Gujarati	`hi` - `gu`
Hindi	Kashmiri	`hi` - `ks`
Hindi	Telugu	`hi` - `te`
Hindi	Kannada	`hi` - `kn`
Hindi	Punjabi	`hi` - `pa`
Hindi	Oriya	`hi` - `or`
Hindi	Urdu	`hi` - `ur`
Hindi	Sindhi	`hi` - `sd`
Hindi	Dogri	`hi` - `doi`
English	Hindi	`en` - `hi`
Telugu	English	`te` - `en`

Dataset Structure and Statistics:
The data is organized by language pair and domain. Each language pair directory contains sub-directories for the specific domains. The following table provides a detailed breakdown of the number of parallel sentences for each language pair, domain, and data split (train/dev/test). An empty cell indicates that data for that specific domain is not available.

Language Pair	General (Train)	General (Dev)	General (Test)	Governance (Train)	Governance (Dev)	Governance (Test)	Health (Train)	Health (Dev)	Health (Test)
`dg_hi`	12,411	500	500	6,947	500	500
`en_hi`				38,790	500	500	10,043	500	500
`en_te`				9,976	500	500	17,237	500	500
`gu_hi`				18,850	500	500
`hi_dg`				30,359	500	500	4,343	500	500
`hi_en`				42,964	500	500	12,187	500	500
`hi_gu`				26,335	500	500	4,899	500	500
`hi_kn`				27,531	500	500	16,351	500	500
`hi_ks`	21,103	500	500
`hi_or`				24,291	500	500	9,387	500	500
`hi_pa`				30,373	500	500	11,328	500	500
`hi_sd`	21,548	500	500				13,233	500	500
`hi_te`				8,061	500	500	11,911	500	500
`hi_ur`	8,956	500	500	9,929	500	500	5,271	500	500
`kn_hi`				16,040	500	500	19,148	500	500
`ks_ur`	2,606	500	500
`or_hi`				11,581	500	500	18,308	500	500
`pa_hi`				22,098	500	500	22,532	500	500
`sd_hi`	3,499	500	500
`te_en`				5,527	500	500	6,008	500	500
`te_hi`				4,405	500	500	18,246	500	500
`ur_hi`	27,791	500	500	8,938	500	500	6,259	500	500
`ur_ks`	22,820	500	500

Domains:
1. Governance: Includes sentences from government documents, press releases, and legal texts.
2. Health: Comprises text from medical journals, healthcare advisories, and public health communications.
3. General: A broad category including sentences from news articles, websites, and miscellaneous sources.

Data Format:
Each dataset configuration is provided as a single tab-separated text file (.txt).
Each line in the file represents a parallel sentence pair, with the source language sentence and the target language sentence separated by a single tab character (\t).

Citation :

If you use this dataset in your research, please consider citing it.

@misc{bhattacharjee2025corilenrichingindianlanguage,
title={CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems},
author={Soham Bhattacharjee and Mukund K Roy and Yathish Poojary and Bhargav Dave and Mihir Raj and Vandan Mujadia and Baban Gain and Pruthwik Mishra and Arafat Ahsan and Parameswari Krishnamurthy and Ashwath Rao and Gurpreet Singh Josan and Preeti Dubey and Aadil Amin Kak and Anna Rao Kulkarni and Narendra VG and Sunita Arora and Rakesh Balbantray and Prasenjit Majumdar and Karunesh K Arora and Asif Ekbal and Dipti Mishra Sharma},
year={2025},
eprint={2509.19941},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.19941},
}

Paper Link:
https://arxiv.org/abs/2509.19941