Indian Flag
Government Of India
A-
A
A+
VAANI: Multi-modal, Multi-lingual Dataset

VAANI: Multi-modal, Multi-lingual Dataset

VAANI is a multi-modal, multi-lingual dataset designed to represent the rich linguistic diversity of India. It currently includes data from two phases—Phase 1 (80 districts) and Phase 2 (40 districts)—spanning a total of ~21,500 hours of spontaneous, image-prompted speech collected from more than 110K speakers across 120 districts, describing 210K images in 86 languages. From this, 835 hours of transcribed audio data is available, distributed nearly evenly across all 120 districts.

About Dataset

The VAANI dataset captures spontaneous speech elicited via image prompts, encouraging rich and natural linguistic expression across diverse Indian languages. It is a high-quality, multimodal, and multilingual resource, curated from 120 districts across 22 Indian states. Each data point in VAANI comprises: An utterance from a native speaker, The corresponding image prompt that inspired the utterance and a manually annotated transcription (available for a subset of the data) The dataset has undergone multiple rounds of quality evaluation to ensure reliability and usability for research and deployment. VAANI includes speech data in 86 languages, many of which are low-resource and rarely represented in existing public datasets: Agariya, Angami, Angika, Ao, Assamese, Awadhi, Bagheli, Bagri, Bajjika, Bearybashe, Bengali, Bhatri, Bhili, Bhojpuri, Bihari, Bundeli, Chakhesang, Chakma, Chhattisgarhi, Dorli, Duruwa, English, Galo, Garhwali, Garo, Gondi, Gujarati, Hajong, Halbi, Harauti, Hindi, Jaipuri, Kannada, Khandeshi, Khariboli, Khorth, KhorthKhotta, Khortha, Kokborok, Konkani, Kumaoni, Kurmali, Kurukh, Lambani, Lotha, Magadhi, MagadhiMagahi, Magahi, Maithili, Malayalam, Malvani, Malvi, Marathi, Marwadi, Marwari, Meitei, Mewari, Mewati, Nagamese, Nepali, Nimadi, NissiDafla, Nyishi, Odia, Oriya, Punjabi, Rajasthani, Rajbanshi, Rengma, Sadri, Sangtam, Santali, Shekhawati, Sindhi, Sumi, Surgujia, Surjapuri, Tagin, Tamil, Telugu, Tenyidie, Thethi, Tulu, Urdu, Wagdi, Wancho. This dataset is designed to support a wide range of speech and language technology applications, including: Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) Foundational speech models for Indian languages Speaker identification and verification Language identification systems Speech enhancement and denoising Multimodal Large Language Models (LLMs) Benchmarking and evaluation for Indic language technologies

Activity Overview Activity Overview

  • Downloads1
  • Redirect 86
  • Views 659
  • File Size 0

Tags Tags

  • Bengali
  • Gujarati
  • Kannada
  • Nepali
  • Punjabi
  • Telugu
  • Urdu
  • Sindhi
  • English
  • Tamil
  • Low-Resource Languages
  • Marathi
  • Malayalam
  • Odia
  • speech transcription
  • spontaneous speech
  • Sangtam
  • Ao
  • Halbi
  • Malvi
  • Sadri
  • Chhattisgarhi
  • Bhili
  • Malvani
  • multilingual corpus
  • Sumi
  • Bagheli
  • Khorth
  • Nyishi
  • multimodal dataset
  • Shekhawati
  • Bagri
  • Mewati
  • Meitei
  • 86 Indian languages
  • Surgujia
  • audio-visual dataset
  • Garo
  • MagadhiMagahi
  • TTS training
  • Wancho
  • Awadhi
  • Galo
  • Oriya
  • speech + image + text
  • Tenyidie
  • regional dialects
  • Rajbanshi
  • Nagamese
  • manual transcription
  • Thethi
  • LLM speech integration
  • KhorthKhotta
  • Konkani
  • quality evaluated
  • speaker identification
  • diverse demographics
  • telemedicine AI applications
  • Marwadi
  • speaker diversity
  • Tulu
  • Assamese
  • Harauti
  • Rajasthani
  • language identification
  • 120+ districts
  • Tagin
  • Marwari
  • Gondi
  • Bajjika
  • 22 Indian states
  • Surjapuri
  • dialect diversity
  • Kokborok
  • image-prompted data
  • Santali
  • speech enhancement
  • Khariboli
  • Rengma
  • geo-centric data collection
  • Hajong
  • Hindi
  • ASR training
  • Wagdi
  • Bhatri
  • Dorli
  • multi-modal language resources
  • reallife recording environments
  • Mewari
  • linguistic diversity
  • Nimadi
  • Khandeshi
  • data for conversational AI
  • benchmarking dataset
  • Lotha
  • Kurmali
  • Bhojpuri
  • NissiDafla
  • Angika
  • Lambani
  • Magahi
  • Jaipuri
  • Bihari
  • Chakhesang
  • Magadhi
  • Angami
  • Chakma
  • Duruwa
  • Bearybashe
  • Khortha
  • Kumaoni
  • Kurukh
  • Garhwali
  • Maithili
  • Agariya
  • Bundeli
  • large-scale speech corpus

License Control License Control

Attribution 4.0 International (CC BY- 4.0)

Related Models Related Models

Nagamese Speech-to-Text
Automatic Speech Recognition (ASR) model for Nagamese speech, designed to transcribe spoken Nagamese into text for real-world usage.
Automatic Speech Recognition
whisper
Nagamese
low-resource-language
Speech Recognition
ASR
  • See Upvoters0
  • Downloads0
  • File Size0
  • Views3
Updated 30 day(s) ago

MWIRE LABS