MS MARCO dataset translated into various Indic languages
This dataset contains the MS MARCO dataset translated into various Indic languages. The original MS MARCO dataset is a collection of queries, passages, and answers for machine reading comprehension and question answering tasks. Each example includes both the original English content and the translated content, along with translation metadata.
| Language Code | Language Name | Train File | Validation File |
|---|---|---|---|
| as | Assamese | asmtrain.jsonl | asmval.jsonl |
| bn | Bengali | bentrain.jsonl | benval.jsonl |
| gu | Gujarati | gutrain.jsonl | guval.jsonl |
| hi | Hindi | hintrain.jsonl | hinval.jsonl |
| kn | Kannada | kantrain.jsonl | kanval.jsonl |
| ml | Malayalam | maltrain.jsonl | malval.jsonl |
| mr | Marathi | martrain.jsonl | marval.jsonl |
| ne | Nepali | neptrain.jsonl | nepval.jsonl |
| or | Odia | ortrain.jsonl | orval.jsonl |
| pa | Punjabi | pantrain.jsonl | panval.jsonl |
| sa | Sanskrit | santrain.jsonl | sanval.jsonl |
| ta | Tamil | tamtrain.jsonl | tamval.jsonl |
| te | Telugu | teltrain.jsonl | telval.jsonl |
| ur | Urdu | urdtrain.jsonl | urdval.jsonl |
MIT
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.