Indic Parallel Corpus: 11 Indian Language Pairs for Machine Translation
Indic Parallel Corpus: 11 Indian Language Pairs for Machine Translation
This dataset contains parallel corpus for machine translation across 11 Indian language pairs. The data is curated to cover three distinct domains: Governance, Health, and General. This dataset is designed to help researchers and developers build and evaluate robust machine translation models for Indian languages.
Description:
The corpus provides parallel sentences for a variety of language pairs, with a focus on Hindi as a pivot language. All translation pairs are bidirectional. The data has been sourced and cleaned to be useful for training Neural Machine Translation (NMT) models.
The dataset includes the following 11 language pairs:
| Source Language | Target Language | Language Codes |
|---|---|---|
| Hindi | Gujarati | hi - gu |
| Hindi | Kashmiri | hi - ks |
| Hindi | Telugu | hi - te |
| Hindi | Kannada | hi - kn |
| Hindi | Punjabi | hi - pa |
| Hindi | Oriya | hi - or |
| Hindi | Urdu | hi - ur |
| Hindi | Sindhi | hi - sd |
| Hindi | Dogri | hi - doi |
| English | Hindi | en - hi |
| Telugu | English | te - en |
| Language Pair | General (Train) | General (Dev) | General (Test) | Governance (Train) | Governance (Dev) | Governance (Test) | Health (Train) | Health (Dev) | Health (Test) |
|---|---|---|---|---|---|---|---|---|---|
dg_hi |
12,411 | 500 | 500 | 6,947 | 500 | 500 | |||
en_hi |
38,790 | 500 | 500 | 10,043 | 500 | 500 | |||
en_te |
9,976 | 500 | 500 | 17,237 | 500 | 500 | |||
gu_hi |
18,850 | 500 | 500 | ||||||
hi_dg |
30,359 | 500 | 500 | 4,343 | 500 | 500 | |||
hi_en |
42,964 | 500 | 500 | 12,187 | 500 | 500 | |||
hi_gu |
26,335 | 500 | 500 | 4,899 | 500 | 500 | |||
hi_kn |
27,531 | 500 | 500 | 16,351 | 500 | 500 | |||
hi_ks |
21,103 | 500 | 500 | ||||||
hi_or |
24,291 | 500 | 500 | 9,387 | 500 | 500 | |||
hi_pa |
30,373 | 500 | 500 | 11,328 | 500 | 500 | |||
hi_sd |
21,548 | 500 | 500 | 13,233 | 500 | 500 | |||
hi_te |
8,061 | 500 | 500 | 11,911 | 500 | 500 | |||
hi_ur |
8,956 | 500 | 500 | 9,929 | 500 | 500 | 5,271 | 500 | 500 |
kn_hi |
16,040 | 500 | 500 | 19,148 | 500 | 500 | |||
ks_ur |
2,606 | 500 | 500 | ||||||
or_hi |
11,581 | 500 | 500 | 18,308 | 500 | 500 | |||
pa_hi |
22,098 | 500 | 500 | 22,532 | 500 | 500 | |||
sd_hi |
3,499 | 500 | 500 | ||||||
te_en |
5,527 | 500 | 500 | 6,008 | 500 | 500 | |||
te_hi |
4,405 | 500 | 500 | 18,246 | 500 | 500 | |||
ur_hi |
27,791 | 500 | 500 | 8,938 | 500 | 500 | 6,259 | 500 | 500 |
ur_ks |
22,820 | 500 | 500 |
If you use this dataset in your research, please consider citing it.
@misc{bhattacharjee2025corilenrichingindianlanguage,Attribution 4.0 International (CC BY- 4.0)
2 files, 22 directories
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.