English-Manipuri Parallel Corpus

A bilingual English–Manipuri parallel corpus designed to support machine translation and NLP research for low-resource Indic languages.

About Dataset

The dataset contains aligned English and Manipuri sentence pairs intended for bilingual language processing and translation tasks. Bible Dataset contains approx. 31K parallel sentences and PIB-PMI Dataset contains approx. 500K parallel sentences. The citation of the dataset - T. J. Singh, S. R. Singh and P. Sarmah, English-Manipuri Machine Translation: An empirical study of different Supervised and Unsupervised Methods, 2021 International Conference on Asian Language Processing (IALP), Singapore, Singapore, 2021, pp. 142-147, doi: 10.1109/IALP54817.2021.9675167. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).

Purpose of Dataset

The Purpose Of This Dataset Is Designed To Support Research In Nlp Systems And English-manipuri Machine Translation. It Can Be Used For Translation Model Development, Multilingual Nlp Tasks, Language Learning Applications And Preservation Of Low Resource Indian Languages.