Samanantar - Largest Parallel Corpus for Indic Languages

Samanantar is the largest publicly available parallel corpus for 11 Indic languages, containing 49.6 million English-to-Indic sentence pairs. It is designed for machine translation and cross-lingual NLP research.

About Dataset

The Samanantar dataset is the most extensive publicly available parallel corpus for Indian languages, supporting translation between English and multiple Indic languages: Hindi, Bengali, Marathi, Tamil, Telugu, Gujarati, Kannada, Malayalam, Odia and Punjabi. It consists of 49.6 million sentence pairs, making it a valuable resource for machine translation, cross-lingual NLP, and multilingual language modeling. The dataset provides high-quality parallel text sourced from various domains, ensuring diverse linguistic coverage. Each data instance includes an indexed ID, source text in English, target text in one of the Indic languages, and the data source. The dataset is widely used for training and evaluating machine translation models and benchmarking cross-lingual tasks.