
A structured questionnaire resource designed for eliciting narration-based speech data in low-resource languages through guided prompts, translation tasks, and stimuli-based linguistic data collection methods under the SpeeD-TB project
This dataset consists of a structured questionnaire developed for eliciting narration-based speech data under the SpeeD-TB project for Indian languages. The questionnaire is designed to facilitate systematic collection of spoken language data through guided narration tasks, enabling speakers to produce natural and contextual speech samples for linguistic and computational research. These questionnaires are for different domains and suitably adapted and tailored for different language communities. It contain 7 languages - Bodo, Chokri, English, Hindi, Kokborok, Meitei and Toto. The different domain it contain are agriculture, culture, education, general oral history, healthcare, lifestyle, science technology and sports. It contain respective audio file for all language except English, Hindi and Toto and the questionnaire is a parallel dataset for all the languages. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).
The Purpose Of This Dataset Is To Support Structured Elicitation And Collection Of Narration-based Speech Data For Low-resource Indian Languages. It Can Be Used For Speech Corpus Development, Language Documentation, Speech Technology Research, Automatic Speech Recognition, Multilingual Nlp, Linguistic Analysis, Low-resource Language Technology Development, And Preservation Of Linguistic And Cultural Heritage.
GNU General Public License, version 3
© 2026 - Copyright AIKosh. All rights reserved.