Hindi-Bengali, Hindi-Konkani, Hindi-Maithili, and Hindi-Marathi parallel corpus created as a part of the VIDYAAPATI project under Mission Bhashini of the Ministry of Electronics and Information Technology (MEITY), Government of India.
Dataset Description: This dataset consists of multiple bilingual parallel corpora containing Hindi sentences and their translations into multiple Indian languages. The corpus has been developed to support research and development in multilingual NLP, particularly Machine Translation and cross-lingual language technologies for West and East Indian languages. The translations in the dataset have been produced by professional human translators, ensuring linguistic accuracy, semantic fidelity, and natural expression in the target languages. The resource is intended to facilitate the training, evaluation, and benchmarking of Machine Translation systems and other multilingual AI applications. Language Pairs Covered: Hindi-Bengali Hindi-Konkani Hindi-Maithili Hindi-Marathi Domain Coverage: The corpus contains content from a range of domains, including Administration, Agriculture, Climate, Education, Health, Law, Technical Tourism, etc., to reflect practical language use across diverse real-world contexts. Project Context: This dataset has been created under the VIDYAAPATI project, which operates under Mission Bhashini, an initiative of the Ministry of Electronics and Information Technology (MEITY), Government of India. The project aims to develop Machine Translation systems between Hindi, and East and West Indian languages to support multilingual digital ecosystems and promote the development of AI systems for Indian languages. Intended Use: The dataset is designed to support: - Development of Machine Translation systems - Research in multilingual and cross-lingual NLP - Domain adaptation studies for translation models - Creation of language technologies and applications for Indian languages Consortium Institutions: The VIDYAAPATI project is being carried out through a consortium of academic and research institutions: - Indian Institute of Technology Bombay (IIT Bombay) - CDAC Kolkata - CDAC Pune - Goa University - Indian Institute of Technology Patna (IIT Patna) - Indian Statistical Institute Kolkata (ISI Kolkata) - Jadavpur University - Jawaharlal Nehru University (JNU) For more information, visit the project's GitHub repository at: https://github.com/cfiltnlp/Bhashini-IITB
The Dataset Is Designed To Support: - Development Of Machine Translation Systems - Research In Multilingual And Cross-lingual Nlp - Domain Adaptation Studies For Translation Models - Creation Of Language Technologies And Applications For Indian Languages
Attribution 4.0 International (CC BY- 4.0)
To preview this file, you need to be a registered user. Please complete the registration process to gain access and continue viewing the content.
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.