Manipuri Monolingual Corpus

This Manipuri Monolingual corpus contains an expanded monolingual corpus for Manipuri. It has been compiled from publicly available texts on the internet in the open domain.The dataset is designed to support low-resource language processing, NLP research, and the development of multilingual AI systems for Manipuri

About Dataset

The Manipuri Monolingual Corpus is a large-scale text dataset containing monolingual Manipuri language data compiled from publicly available open-domain internet sources. The corpus is intended to support computational research and language technology development for Manipuri, a low-resource Indic language with limited digital linguistic resources. The dataset is divided into multiple subsets with varying quality levels and contains millions of words of Manipuri text. According to the dataset documentation, Set 1 contains approximately 11 million words, Set 2 contains approximately 19 million words, and Set 3 contains approximately 76 million words. Set 1 is of high quality. Set 3 is of low quality, as its corpus has been converted from glyphs. Set 2 is in between the two. The citation of the dataset - T. J. Singh, S. R. Singh and P. Sarmah, English-Manipuri Machine Translation: An empirical study of different Supervised and Unsupervised Methods, 2021 International Conference on Asian Language Processing (IALP), Singapore, Singapore, 2021, pp. 142-147, doi: 10.1109/IALP54817.2021.9675167. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).

Purpose of Dataset

The Purpose Of This Dataset Is To Support Research And Development In Manipuri Language Processing And Low-resource Language Technologies By Providing Monolingual Corpus For Computational And Linguistic Analysis. The Dataset Enables The Development And Training Of Nlp And Machine Learning Models For Tasks Such As Language Modeling, Text Generation, Text Classification, And Multilingual Language Understanding. It Also Supports Linguistic Research, Educational Applications, And The Preservation Of Manipuri Language Resources By Providing Digitally Accessible Textual Data For One Of The Underrepresented Languages Of The Indian Subcontinent.