ORGANISATION

SpeeD-TB-Toto

200 hours of transcribed speech dataset of Toto, an endangered Tibeto-Burman language spoken in Totopara village of West Bengal, developed under the SpeeD-TB Project.

About Dataset

The Toto Speech Dataset, developed as part of Speech Datasets and Models for Tibeto-Burman Languages (Project SpeeD-TB) funded under Mission Bhashini, is a transcribed speech corpus of Toto, an under-resourced and critically endangered language spoken by a small community in Totopara village, Alipurduar district, West Bengal, India, with less than 1,000 speakers. The dataset comprises approximately 200 hours of high-quality audio recordings paired with accurate transcriptions in both IPA and Bengali script, making it a valuable resource for building and evaluating voice models in low-resource linguistic settings. The audio data captures a diverse range of speakers across different age groups, genders, and education, ensuring variability in pronunciation, speech patterns, and tone. It includes both spontaneous and read speech collected in naturalistic and semi-controlled environments, thereby reflecting real-world linguistic usage. The transcriptions are carefully prepared and normalised to maintain consistency, supporting robust model training. This dataset also contributes to the preservation and digital documentation of the Toto language by transforming oral knowledge into structured, machine-readable formats. Almost 60% of the data in the corpus is included from domains of agriculture, education and science & technology. Rest of the data is from varied domains including culture, lifecycle, sports, entertainment, healthcare and oral history, thereby, giving a large coverage. We have also used a variety of elicitation methods for collecting the data including translations, narrations, lectures, role-play, spontaneous conversations, interviews and picture and video descriptions. The released dataset is meticulously mapped to a rich metadats including demographic and linguistic metadata of the speakers, domains, elicitation methods and to individual prompts. The audio included in the current dataset is already sliced at sentence level, thereby, ready to be integrated into the model training pipeline out-of-the-box.

Purpose of Dataset

The Speech Corpus Is Primarily Developed For Building Voice Ai Systems For The Toto Language. It Is Also Expected To Be Useful For Researchers Working In Linguistics, Anthropology, Language Technologies. For A Language Like Toto, The Dataset Is Also Expected To Prove Useful For Language Preservation And Standardisation, While Enabling The Development Of Inclusive Technologies Such As Voice Assistants, Speech-to-text Systems, And Pedagogical And Language Learning Tools For The Community.