
VAANI is a multi-modal, multi-lingual dataset designed to represent the rich linguistic diversity of India. It currently includes data from two phases—Phase 1 (80 districts) and Phase 2 (40 districts)—spanning a total of ~21,500 hours of spontaneous, image-prompted speech collected from more than 110K speakers across 120 districts, describing 210K images in 86 languages. From this, 835 hours of transcribed audio data is available, distributed nearly evenly across all 120 districts.
The VAANI dataset captures spontaneous speech elicited via image prompts, encouraging rich and natural linguistic expression across diverse Indian languages. It is a high-quality, multimodal, and multilingual resource, curated from 120 districts across 22 Indian states. Each data point in VAANI comprises: An utterance from a native speaker, The corresponding image prompt that inspired the utterance and a manually annotated transcription (available for a subset of the data) The dataset has undergone multiple rounds of quality evaluation to ensure reliability and usability for research and deployment. VAANI includes speech data in 86 languages, many of which are low-resource and rarely represented in existing public datasets: Agariya, Angami, Angika, Ao, Assamese, Awadhi, Bagheli, Bagri, Bajjika, Bearybashe, Bengali, Bhatri, Bhili, Bhojpuri, Bihari, Bundeli, Chakhesang, Chakma, Chhattisgarhi, Dorli, Duruwa, English, Galo, Garhwali, Garo, Gondi, Gujarati, Hajong, Halbi, Harauti, Hindi, Jaipuri, Kannada, Khandeshi, Khariboli, Khorth, KhorthKhotta, Khortha, Kokborok, Konkani, Kumaoni, Kurmali, Kurukh, Lambani, Lotha, Magadhi, MagadhiMagahi, Magahi, Maithili, Malayalam, Malvani, Malvi, Marathi, Marwadi, Marwari, Meitei, Mewari, Mewati, Nagamese, Nepali, Nimadi, NissiDafla, Nyishi, Odia, Oriya, Punjabi, Rajasthani, Rajbanshi, Rengma, Sadri, Sangtam, Santali, Shekhawati, Sindhi, Sumi, Surgujia, Surjapuri, Tagin, Tamil, Telugu, Tenyidie, Thethi, Tulu, Urdu, Wagdi, Wancho. This dataset is designed to support a wide range of speech and language technology applications, including: Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) Foundational speech models for Indian languages Speaker identification and verification Language identification systems Speech enhancement and denoising Multimodal Large Language Models (LLMs) Benchmarking and evaluation for Indic language technologies
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.