Indian Flag
Government Of India
A-
A
A+
Samanantar - Largest Parallel Corpus for Indic Languages

Samanantar - Largest Parallel Corpus for Indic Languages

Samanantar is the largest publicly available parallel corpus for 11 Indic languages, containing 49.6 million English-to-Indic sentence pairs. It is designed for machine translation and cross-lingual NLP research.

About Dataset

The Samanantar dataset is the most extensive publicly available parallel corpus for Indian languages, supporting translation between English and multiple Indic languages: Hindi, Bengali, Marathi, Tamil, Telugu, Gujarati, Kannada, Malayalam, Odia and Punjabi. It consists of 49.6 million sentence pairs, making it a valuable resource for machine translation, cross-lingual NLP, and multilingual language modeling. The dataset provides high-quality parallel text sourced from various domains, ensuring diverse linguistic coverage. Each data instance includes an indexed ID, source text in English, target text in one of the Indic languages, and the data source. The dataset is widely used for training and evaluating machine translation models and benchmarking cross-lingual tasks.

Activity Overview Activity Overview

  • Downloads0
  • Redirect 103
  • Views 1,100
  • File Size 0

Tags Tags

  • Machine Translation
  • Parallel Corpus
  • Multilingual Dataset
  • NLP
  • Indic Languages
  • English-Indic translation
  • bilingual dataset
  • cross-lingual NLP

License Control License Control

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)