A large scale monolingual Bodo text corpus, intended for machine translation and NLP research on a low-resource Indian language.
A large scale monolingual Bodo text corpus, intended for machine translation and NLP research on a low-resource Indian language. The author does not claim any ownership of the data shared here. The data is mostly from TDIL, and some texts are crawled using Bodo News Crawlers. All copyrights belong to the respective owners. This data is a large collection of text written in Bodo language, containing nearly 490,000. The dataset is divided into two parts: a larger training set of around 475k sentences and a smaller test set of about 14.9k sentences. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).
The Purpose Of This Dataset Is To Support Research And Development In Bodo Language Processing And Low-resource Language Technologies By Providing A Large Scale Monolingual Corpus. The Dataset Enables The Development And Training Of Nlp And Machine Learning Models For Tasks Such As Language Modeling, Text Generation, Text Classification. It Also Supports Linguistic Research, Educational Applications, And The Preservation Of Bodo Language Resources.
MIT
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.