COMTAIL DATA is a large human evaluation ratings dataset for 13
Indian languages covering 21 translation directions. We make available
the complete dataset.
Train, dev, test splits are shown per language pair along with aggregated totals over all language pairs.
| srclng | tgtlng | train | dev | test |
|---|---|---|---|---|
| eng | ban | 9823 | 484 | 481 |
| eng | guj | 8232 | 408 | 401 |
| eng | hin | 12164 | 567 | 571 |
| eng | kan | 11212 | 542 | 559 |
| eng | kas | 9773 | 501 | 491 |
| eng | mar | 11853 | 583 | 598 |
| eng | odi | 2023 | 99 | 103 |
| eng | pan | 7240 | 345 | 349 |
| eng | tam | 11719 | 543 | 538 |
| eng | tel | 12127 | 572 | 615 |
| eng | urd | 10639 | 533 | 525 |
| hin | ban | 5468 | 280 | 274 |
| hin | doi | 8726 | 443 | 451 |
| hin | guj | 7684 | 379 | 379 |
| hin | kan | 10555 | 520 | 525 |
| hin | mar | 11826 | 583 | 597 |
| hin | odi | 8950 | 458 | 456 |
| hin | pan | 10522 | 517 | 513 |
| hin | snd | 8868 | 451 | 448 |
| hin | tel | 8190 | 405 | 403 |
| hin | urd | 12143 | 602 | 610 |
| all | all | 199737 | 9815 | 9887 |
Each file contains the following columns:
srclng → ISO code of the source language (e.g., eng = English, hin = Hindi).
tgtlng → ISO code of the target language (e.g., ban = Bangla, urd = Urdu).
src → Original source sentence in the source language.
mt → Translation hypothesis in the target languages.
ref → Human reference translation in the target language.
score → Standardized Quality score assigned to the hypothesis as a DA+SQM rating.
origin → Identifier of the Translation hypothesis.
domain → Domain label (fine).
bucket → Source length bucket (#words).
| srclng | tgtlng | src | mt | ref | score | origin | domain | bucket |
|---|---|---|---|---|---|---|---|---|
| eng | hin | Don't let them know what you're up to. | उन्हें यह न बताने दो कि तुम क्या कर रहे हो. | उन्हें यह न बताएँ कि आप क्या कर रहे हैं। | 0.4216 | SeamlessRPT | saman | 10.0 |
Attribution 4.0 International (CC BY- 4.0)
5 files, 1 directories
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.