Home/Datasets/IndicVoices

Nikhil Narasimhan

ORGANISATION

IndicVoices

Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

About Dataset

INDICVOICES is a dataset of natural and spontaneous speech containing a total of 23.7K hours of read (8%), extempore (76%) and conversational (15%) audio from 51K speakers covering 400+ Indian districts and 22 languages. See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/IndicVoices.

Purpose of Dataset

To Build Robust Speech Interfaces

Dataset Metadata

License

Attribution 4.0 International (CC BY- 4.0)

Geographical coverage

India

Sector

Science, Technology and Research

Author

ai4bharat

Source Organisation

Uploaded by

Data Quality Score (Beta)

-

Dataset type

Unstructured

Frequency

NA

Time Granularity

NA

Year range

N.A.

Date & Time

20/05/25 08:10:06

Visibility

Open

Hosted / Redirected

Hosted

Data Type

Primary

Activity Overview

0
40
0
281

Tags

Speech Dataset

License Control

Attribution 4.0 International (CC BY- 4.0)

© 2026 - Copyright AIKosh. All rights reserved.