 (1).png)
RESPIN-S1.0 is the largest publicly available dialect-rich read speech corpus for Indian languages, featuring over 10,000 hours of validated audio across 38+ dialects from nine languages. RESPIN-S1.0 introduces a large-scale, multi-dialectal, multi-domain read speech corpus for nine Indian languages – Bengali, Bhojpuri, Chhattisgarhi, Hindi, Kannada, Magahi, Maithili, Marathi, and Telugu. Curated for agriculture and finance domains, it includes rich speaker metadata, phonetic lexicons, and dialect-aware splits to support robust ASR and language research in low-resource, multilingual settings. RESPIN-S1.0 is the first large-scale, publicly available corpus that combines dialectal and domain coverage across nine Indian languages, including low-resource ones like Bhojpuri, Chhattisgarhi, and Magahi.
RESPIN-S1.0: A Dialect-Rich ASR Corpus for Indian Languages RESPIN-S1.0 is a dialect-rich automatic speech recognition (ASR) corpus in nine Indian languages, developed as part of the RESPIN project at SPIRE Lab, Indian Institute of Science (IISc) Bangalore, India. It comprises over 10,000 hours of read speech from the following languages: Bengali (bn), Bhojpuri (bh), Chhattisgarhi (ch), Hindi (hi), Kannada (kn), Magahi (mg), Maithili (mt), Marathi (mr), and Telugu (te). This corpus is distinguished by its: - Rich dialectal and domain diversity in sentence content - Balanced speaker representation across gender, age, and socio-economic backgrounds - Inclusion of underrepresented populations, primarily from low-income, agriculturally dependent communities Data Collection Highlights: - Sentences were sourced from the agriculture and finance domains. - Native speakers and domain experts from each pincode region contributed to sentence composition, translation, and validation (both manual and programmatic). - Over 200,000 utterances were collected using a crowdsourced mobile application. - Utterances were labeled as clean, semi-noisy, or noisy based on transcription quality. - The clean slab alone exceeds 10,000 hours. Additional Resources: - Speaker metadata, phonetic lexicons, and dialect-aware train/dev/test splits are provided to facilitate reproducible research. - RESPIN_S1.0 supports work in dialectal ASR, language and dialect identification (LID/DID), and other speech-related research in multilingual, low-resource settings. - Dialectal Coverage per Language (Alphabetically Ordered): Bengali - D1: Western (West Medinipore) - D2: Varendri/Northern (Dinajpur Dakshin, Malda) - D3: Standard Colloquial (South 24 Parganas) - D4: Jharkhandi (Purulia) - D5: Rajbangshi (Jalpaiguri) Bhojpuri - D1: Northern (East Champaran, Deoria) - D2: Western (Varanasi) - D3: Southern/Standard (Saran) Chhattisgarhi - D1: Central (Bilaspur) - D2: Eastern (Raigarh) - D3: Budati/Khatahi/Western (Kabirdham) - D4: Bhandar/Northern (Sarguja) Hindi - D1: Hindustani + Malvi + Khadi Boli (Muzaffarnagar, UP) - D2: Kanauji + Braj Bhasha (Etah, UP) - D3: Awadhi + Bundeli (Hamirpur, UP) - D4: Marwari + Dhundhari (Nagaur, Rajasthan) - D5: Garhwali (Tehri Garhwal, Uttarakhand) Kannada - D1: Central (Bellary) - D2: Coastal/Dakshin (Dakshina Kannada) - D3: Dharwad/North West (Dharwad) - D4: Northeastern (Gulbarga) - D5: Mysore Kannada (Mysuru) Magahi - D1: Standard (Patna, Gaya) - D2: Southern (Lakhisarai) - D3: Western (Vaishali) - D4: North Eastern/Surjapuri (Kishanganj) Maithili - D1: Bajjika (Samastipur) - D2: Eastern/Thethi (Madhepura) - D3: Southern/Standard (Darbhanga) - D4: Angika (Bhagalpur) Marathi - D1: Southern Konkan (Sindhudurg) - D2: Northern Konkan (Nashik, Dhule) - D3: Standard (Pune) - D4: Varhadi (Nagpur) Telugu - D1: Coastal/Central (Guntur, Krishna) - D2: Southern (Chittoor, Anantapur) - D3: Telangana/Northern (Karimnagar, Nalgonda) - D4: Uttarandhra/Eastern (Srikakulam, Visakhapatnam)
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.