Home/Datasets/VIDYAAPATI Parallel Corpus

ORGANISATION

VIDYAAPATI Parallel Corpus

Hindi-Bengali, Hindi-Konkani, Hindi-Maithili, and Hindi-Marathi parallel corpus created as a part of the VIDYAAPATI project under Mission Bhashini of the Ministry of Electronics and Information Technology (MEITY), Government of India.

About Dataset

Dataset Description: This dataset consists of multiple bilingual parallel corpora containing Hindi sentences and their translations into multiple Indian languages. The corpus has been developed to support research and development in multilingual NLP, particularly Machine Translation and cross-lingual language technologies for West and East Indian languages. The translations in the dataset have been produced by professional human translators, ensuring linguistic accuracy, semantic fidelity, and natural expression in the target languages. The resource is intended to facilitate the training, evaluation, and benchmarking of Machine Translation systems and other multilingual AI applications. Language Pairs Covered: Hindi-Bengali Hindi-Konkani Hindi-Maithili Hindi-Marathi Domain Coverage: The corpus contains content from a range of domains, including Administration, Agriculture, Climate, Education, Health, Law, Technical Tourism, etc., to reflect practical language use across diverse real-world contexts. Project Context: This dataset has been created under the VIDYAAPATI project, which operates under Mission Bhashini, an initiative of the Ministry of Electronics and Information Technology (MEITY), Government of India. The project aims to develop Machine Translation systems between Hindi, and East and West Indian languages to support multilingual digital ecosystems and promote the development of AI systems for Indian languages. Intended Use: The dataset is designed to support: - Development of Machine Translation systems - Research in multilingual and cross-lingual NLP - Domain adaptation studies for translation models - Creation of language technologies and applications for Indian languages Consortium Institutions: The VIDYAAPATI project is being carried out through a consortium of academic and research institutions: - Indian Institute of Technology Bombay (IIT Bombay) - CDAC Kolkata - CDAC Pune - Goa University - Indian Institute of Technology Patna (IIT Patna) - Indian Statistical Institute Kolkata (ISI Kolkata) - Jadavpur University - Jawaharlal Nehru University (JNU) For more information, visit the project's GitHub repository at: https://github.com/cfiltnlp/Bhashini-IITB

Purpose of Dataset

The Dataset Is Designed To Support: - Development Of Machine Translation Systems - Research In Multilingual And Cross-lingual Nlp - Domain Adaptation Studies For Translation Models - Creation Of Language Technologies And Applications For Indian Languages

Dataset Metadata

License

Attribution 4.0 International (CC BY- 4.0)

Geographical coverage

Country

Sector

Science, Technology and Research

Author

VIDYAAPATI Consortium

Source Organisation

Digital India BHASHINI Division

Uploaded by

Sourabh Dattatray Deoghare

Data Quality Score (Beta)

4.5

Dataset type

Structured

Frequency

Time Granularity

Year range

N.A.

Date & Time

11/03/26 15:43:25

Visibility

Open

Hosted / Redirected

Hosted

Data Type

Primary

Data Collection Method

The Data Selection Inlcudes Manual Collection Of High-quality Sentnce-level Segments From Copyright-free Sources. The Data Went Through Sensitive Information Removal, Cleaning And Validation Before Translation.

Activity Overview

0
6
57.69 MB
130

License Control

Attribution 4.0 International (CC BY- 4.0)

VIDYAAPATI-IIT_Bombay_Hindi-Bengali_CDACK.json ( 13.36 MB )

To preview this file, you need to be a registered user. Please complete the registration process to gain access and continue viewing the content.

Data Quality Score Beta

Version Control

Version 1(57.69 MB)

Sourabh Dattatray Deoghare·4 month(s) ago
- VIDYAAPATI-IIT_Bombay_Hindi-Bengali_CDACK.json
- VIDYAAPATI-IIT_Bombay_Hindi-Konkani_GU.json
- VIDYAAPATI-IIT_Bombay_Hindi-Maithili_JNU.json
- VIDYAAPATI-IIT_Bombay_Hindi-Marathi_CDACP-IITB.json

Accessibility options by UX4G

VIDYAAPATI Parallel Corpus

About Dataset

Purpose of Dataset

Dataset Metadata

Activity Overview

Tags

License Control

VIDYAAPATI-IIT_Bombay_Hindi-Bengali_CDACK.json ( 13.36 MB )

Data Quality Score Beta

Version Control

Version 1(57.69 MB)

VIDYAAPATI-IIT_Bombay_Hindi-Bengali_CDACK.json

VIDYAAPATI-IIT_Bombay_Hindi-Konkani_GU.json

VIDYAAPATI-IIT_Bombay_Hindi-Maithili_JNU.json

VIDYAAPATI-IIT_Bombay_Hindi-Marathi_CDACP-IITB.json

AIKosh

Resources

Support