Indian Flag
Government Of India
A-
A
A+
VIDYAAPATI Parallel Corpus

VIDYAAPATI Parallel Corpus

Hindi-Bengali, Hindi-Konkani, Hindi-Maithili, and Hindi-Marathi parallel corpus created as a part of the VIDYAAPATI project under Mission Bhashini of the Ministry of Electronics and Information Technology (MEITY), Government of India.

About Dataset

Dataset Description: This dataset consists of multiple bilingual parallel corpora containing Hindi sentences and their translations into multiple Indian languages. The corpus has been developed to support research and development in multilingual NLP, particularly Machine Translation and cross-lingual language technologies for West and East Indian languages. The translations in the dataset have been produced by professional human translators, ensuring linguistic accuracy, semantic fidelity, and natural expression in the target languages. The resource is intended to facilitate the training, evaluation, and benchmarking of Machine Translation systems and other multilingual AI applications. Language Pairs Covered: Hindi-Bengali Hindi-Konkani Hindi-Maithili Hindi-Marathi Domain Coverage: The corpus contains content from a range of domains, including Administration, Agriculture, Climate, Education, Health, Law, Technical Tourism, etc., to reflect practical language use across diverse real-world contexts. Project Context: This dataset has been created under the VIDYAAPATI project, which operates under Mission Bhashini, an initiative of the Ministry of Electronics and Information Technology (MEITY), Government of India. The project aims to develop Machine Translation systems between Hindi, and East and West Indian languages to support multilingual digital ecosystems and promote the development of AI systems for Indian languages. Intended Use: The dataset is designed to support: - Development of Machine Translation systems - Research in multilingual and cross-lingual NLP - Domain adaptation studies for translation models - Creation of language technologies and applications for Indian languages Consortium Institutions: The VIDYAAPATI project is being carried out through a consortium of academic and research institutions: - Indian Institute of Technology Bombay (IIT Bombay) - CDAC Kolkata - CDAC Pune - Goa University - Indian Institute of Technology Patna (IIT Patna) - Indian Statistical Institute Kolkata (ISI Kolkata) - Jadavpur University - Jawaharlal Nehru University (JNU) For more information, visit the project's GitHub repository at: https://github.com/cfiltnlp/Bhashini-IITB

Purpose of Dataset

The Dataset Is Designed To Support: - Development Of Machine Translation Systems - Research In Multilingual And Cross-lingual Nlp - Domain Adaptation Studies For Translation Models - Creation Of Language Technologies And Applications For Indian Languages

Activity Overview Activity Overview

  • Downloads0
  • Downloads 0
  • Views 10
  • File Size 57.69 MB

Tags Tags

  • Parallel Corpus
  • Bhashini
  • NMT
  • IITB
  • IITBombay
  • parallel sentences
  • language:mai
  • language:hin
  • language:mar
  • language:kok
  • language:ben

License Control License Control

Attribution 4.0 International (CC BY- 4.0)

VIDYAAPATI-IIT_Bombay_Hindi-Bengali_CDACK.json ( 13.36 MB )


To preview this file, you need to be a registered user. Please complete the registration process to gain access and continue viewing the content.

Data Quality Score BetaData Quality Score Beta

Version Control Version Control

FolderVersion 1(57.69 MB)
  • Sourabh Dattatray Deoghare·4 day(s) ago
    • application/json
      VIDYAAPATI-IIT_Bombay_Hindi-Bengali_CDACK.json
    • application/json
      VIDYAAPATI-IIT_Bombay_Hindi-Konkani_GU.json
    • application/json
      VIDYAAPATI-IIT_Bombay_Hindi-Maithili_JNU.json
    • application/json
      VIDYAAPATI-IIT_Bombay_Hindi-Marathi_CDACP-IITB.json