Assamese-Bodo, English-Assamese, English-Bodo, English-Manipuri, English-Nepali, and Hindi-Manipuri sentence-level parallel corpus created as a part of the ISHAAN project under Mission Bhashini of the Ministry of Electronics and Information Technology (MEITY), Government of India.
Dataset Description: This dataset consists of multiple bilingual parallel corpora containing English/Hindi sentences and their translations into multiple North-East Indian languages. The corpus has been developed to support research and development in multilingual NLP, particularly Machine Translation and cross-lingual language technologies for North-East Indian languages. The translations in the dataset have been produced by professional human translators, ensuring linguistic accuracy, semantic fidelity, and natural expression in the target languages. The resource is intended to facilitate the training, evaluation, and benchmarking of Machine Translation systems and other multilingual AI applications. Language Pairs Covered: Assamese-Bodo English-Assamese English-Bodo English-Manipuri English-Nepali Hindi-Manipuri Domain Coverage: The corpus contains content from a range of domains, including Administration, Agriculture, Climate, Education, Health, Law, Technical Tourism, etc., to reflect practical language use across diverse real-world contexts. The primary domains represented in the dataset include: Project Context: This dataset has been created under the ISHAAN project, which operates under Mission Bhashini, an initiative of the Ministry of Electronics and Information Technology (MEITY), Government of India. The project aims to develop Machine Translation systems between English, Hindi, and North-East Indian languages to support multilingual digital ecosystems and promote the development of AI systems for Indian languages. Intended Use: The dataset is designed to support: - Development of Machine Translation systems - Research in multilingual and cross-lingual NLP - Domain adaptation studies for translation models - Creation of language technologies and applications for Indian languages Consortium Institutions: The ISHAAN project is being carried out through a consortium of academic and research institutions: - Indian Institute of Technology Bombay (IIT Bombay) - Indian Institute of Information Technology Manipur (IIIT Manipur) - International Institute of Information Technology Hyderabad (IIIT Hyderabad) - Gauhati University - National Institute of Technology Meghalaya (NIT Meghalaya) - University of North Bengal For more information, visit the project's GitHub repository at: https://github.com/cfiltnlp/Bhashini-IITB
The Dataset Is Designed To Support: - Development Of Machine Translation Systems - Research In Multilingual And Cross-lingual Nlp - Domain Adaptation Studies For Translation Models - Creation Of Language Technologies And Applications For Indian Languages
Attribution 4.0 International (CC BY- 4.0)
To preview this file, you need to be a registered user. Please complete the registration process to gain access and continue viewing the content.
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.