ORGANISATION

ISHAAN Parallel Corpus

Assamese-Bodo, English-Assamese, English-Bodo, English-Manipuri, English-Nepali, and Hindi-Manipuri sentence-level parallel corpus created as a part of the ISHAAN project under Mission Bhashini of the Ministry of Electronics and Information Technology (MEITY), Government of India.

About Dataset

Dataset Description: This dataset consists of multiple bilingual parallel corpora containing English/Hindi sentences and their translations into multiple North-East Indian languages. The corpus has been developed to support research and development in multilingual NLP, particularly Machine Translation and cross-lingual language technologies for North-East Indian languages. The translations in the dataset have been produced by professional human translators, ensuring linguistic accuracy, semantic fidelity, and natural expression in the target languages. The resource is intended to facilitate the training, evaluation, and benchmarking of Machine Translation systems and other multilingual AI applications. Language Pairs Covered: Assamese-Bodo English-Assamese English-Bodo English-Manipuri English-Nepali Hindi-Manipuri Domain Coverage: The corpus contains content from a range of domains, including Administration, Agriculture, Climate, Education, Health, Law, Technical Tourism, etc., to reflect practical language use across diverse real-world contexts. The primary domains represented in the dataset include: Project Context: This dataset has been created under the ISHAAN project, which operates under Mission Bhashini, an initiative of the Ministry of Electronics and Information Technology (MEITY), Government of India. The project aims to develop Machine Translation systems between English, Hindi, and North-East Indian languages to support multilingual digital ecosystems and promote the development of AI systems for Indian languages. Intended Use: The dataset is designed to support: - Development of Machine Translation systems - Research in multilingual and cross-lingual NLP - Domain adaptation studies for translation models - Creation of language technologies and applications for Indian languages Consortium Institutions: The ISHAAN project is being carried out through a consortium of academic and research institutions: - Indian Institute of Technology Bombay (IIT Bombay) - Indian Institute of Information Technology Manipur (IIIT Manipur) - International Institute of Information Technology Hyderabad (IIIT Hyderabad) - Gauhati University - National Institute of Technology Meghalaya (NIT Meghalaya) - University of North Bengal For more information, visit the project's GitHub repository at: https://github.com/cfiltnlp/Bhashini-IITB

Purpose of Dataset

The Dataset Is Designed To Support: - Development Of Machine Translation Systems - Research In Multilingual And Cross-lingual Nlp - Domain Adaptation Studies For Translation Models - Creation Of Language Technologies And Applications For Indian Languages

Dataset Metadata

License

Attribution 4.0 International (CC BY- 4.0)

Geographical coverage

Country

Sector

Science, Technology and Research

Author

ISHAAN Consortium

Source Organisation

Digital India BHASHINI Division

Uploaded by

Sourabh Dattatray Deoghare

Data Quality Score (Beta)

4.5

Dataset type

Structured

Frequency

Time Granularity

Year range

N.A.

Date & Time

11/03/26 13:27:02

Visibility

Open

Hosted / Redirected

Hosted

Data Type

Primary

Data Collection Method

The Data Selection Inlcudes Manual Collection Of High-quality Sentnce-level Segments From Copyright-free Sources. The Data Went Through Sensitive Information Removal, Cleaning And Validation Before Translation.

Activity Overview

0
8
99.19 MB
135

License Control

Attribution 4.0 International (CC BY- 4.0)

ISHAAN-IIT_Bombay_Assamese-Bodo_Gauhati_University.json ( 20.95 MB )

To preview this file, you need to be a registered user. Please complete the registration process to gain access and continue viewing the content.

Data Quality Score Beta

Version Control

Version 1(99.19 MB)

Sourabh Dattatray Deoghare·4 month(s) ago
- ISHAAN-IIT_Bombay_Assamese-Bodo_Gauhati_University.json
- ISHAAN-IIT_Bombay_English-Assamese_Gauhati_University.json
- ISHAAN-IIT_Bombay_English-Bodo_Gauhati_University.json
- ISHAAN-IIT_Bombay_English-Manipuri_IIit_Manipur.json
- ISHAAN-IIT_Bombay_English-Nepali_University_of_North_Bengal.json
- ISHAAN-IIT_Bombay_Hindi-Manipuri_IIIT_Manipur.json

Accessibility options by UX4G

ISHAAN Parallel Corpus

About Dataset

Purpose of Dataset

Dataset Metadata

Activity Overview

Tags

License Control

ISHAAN-IIT_Bombay_Assamese-Bodo_Gauhati_University.json ( 20.95 MB )

Data Quality Score Beta

Version Control

Version 1(99.19 MB)

ISHAAN-IIT_Bombay_Assamese-Bodo_Gauhati_University.json

ISHAAN-IIT_Bombay_English-Assamese_Gauhati_University.json

ISHAAN-IIT_Bombay_English-Bodo_Gauhati_University.json

ISHAAN-IIT_Bombay_English-Manipuri_IIit_Manipur.json

ISHAAN-IIT_Bombay_English-Nepali_University_of_North_Bengal.json

ISHAAN-IIT_Bombay_Hindi-Manipuri_IIIT_Manipur.json

AIKosh

Resources

Support