Indian Flag
Government Of India
A-
A
A+
English to Indian languages parallel dataset

English to Indian languages parallel dataset

Human translated bi-directional Parallel corpus from English to six Indian languages

About Dataset

This dataset presents a high-quality human-translated parallel corpus comprising English sentences aligned with their translations into six major Indian languages. The resource has been meticulously curated to support research and development in machine translation.

The parallel corpus is domain-balanced, with content systematically selected and translated across five key domains that reflect real-world linguistic diversity and practical relevance:

1. Governance and Policy (Primary Domain): Texts from administrative communications, government schemes, policies, and citizen-centric materials.

2. Science and Technology: Passages covering emerging technologies, innovation, and scientific awareness content.

3. Education: Educational materials, curriculum-based texts, and pedagogical content.

4. Health: Public health information, medical advisories, and awareness literature.

5. Agriculture: Farmer outreach, crop management, and rural development-related texts.

All translations have been performed and validated by professional human translators, ensuring high linguistic fidelity, semantic equivalence, and domain consistency across languages. Rigorous quality checks were followed to maintain alignment accuracy and contextual relevance.

This corpus has been developed under the EILMT (English to Indian Languages Machine Translation) consortium, operating within the framework of the Mission Bhashini initiative of the Government of India. The effort aligns with the national vision of enabling language inclusivity, accessibility of digital content, and fostering multilingual AI technologies.

The dataset serves as a benchmark resource for building and evaluating translation systems, domain adaptation studies, and linguistic resource development for Indian languages—thereby contributing to the broader goals of linguistic empowerment and digital inclusivity in India.

Consortia Members
The project is being carried out in collaboration with the following institutions:

Centre for Development of Advanced Computing (C-DAC), Noida
Centre for Development of Advanced Computing (C-DAC), Pune
Indian Institute of Technology (IIT) Bombay
IIIT Hyderabad
AU-KBC, Anna University Chennai
Banasthali Vidyapith
C-DAC Bengaluru
C-DAC Trivandrum
Dharmsinh Desai University, Gujarat
IIIT Bhubaneswar, Odisha

For more information, visit project's github at: https://github.com/eilmt/NLTM-EILMT 

Purpose of Dataset

To Support Research And Development In Machine Translation Technology.

Activity Overview Activity Overview

  • Downloads0
  • Downloads 2
  • Views 48
  • File Size 80.90 MB

Tags Tags

  • Parallel Corpus
  • license:cc-by-4.0

License Control License Control

Attribution 4.0 International (CC BY- 4.0)

English-Gujarati Data ( 1 directories )


Directory
English-Gujarati_Data

11 files

Data Quality Score BetaData Quality Score Beta

Version Control Version Control

FolderVersion 1(80.90 MB)
  • Mukund Kumar Roy·1 month(s) ago
    • chevron_rightFolder
      English-Gujarati Data
      • chevron_rightFolder
        English-Gujarati_Data
    • chevron_rightFolder
      English-Hindi Data
    • chevron_rightFolder
      English-Kannada Data
    • chevron_rightFolder
      English-Malayalam Data
    • chevron_rightFolder
      English-Marathi Data
    • chevron_rightFolder
      English-Odia Data
    • chevron_rightFolder
      Gujarati-English Data
    • chevron_rightFolder
      Hindi-English Data
    • chevron_rightFolder
      Kannada-English Data
    • chevron_rightFolder
      Malayalam-English Data
    • more_horiz 2 more