
A speech and transcription dataset for the Kokborok language containing raw and transcribed audio collected through multiple elication and recording methodsto support technology, NLP research and language preservation.
This dataset is a collection of raw and transcribed speech data for the Kokborok language, curated under the SpeeD-TB project for Tibeto-Burman Indian languages. The data will be collected approximately from 80-100 speakers, and mostly from the age groups 20-50 years. The dataset is mostly collected from education, agriculture and science and technlogy domain. It contains audio recordings of approximately 3.5 hours and corresponding transcriptions. The dataset contributes to digital language preservation and computational resource development for Kokborok, which is an important Tibeto-Burman language spoken in India. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s)
The Purpose Of This Dataset Is To Support The Development Of Speech Technologies And Natural Language Processing Tools For The Kokborok Language. It Can Be Used For Automatic Speech Recognition, Speech-to-text Modeling, Multilingual Ai Systems, Linguistic Analysis, Language Modeling, Transcription Research, Low-resource Language Technology Development, And Preservation Of Linguistic And Cultural Heritage Associated With The Kokborok Language.
Attribution 4.0 International (CC BY- 4.0)
1409 files
284.19 KB
© 2026 - Copyright AIKosh. All rights reserved.