ORGANISATION

IndicSynth

IndicSynth is a large-scale multilingual synthetic speech dataset for advancing audio deepfake detection and anti-spoofing research. It covers 12 Indian languages and contains over 4,000 hours of synthetic speech. Metadata: speaker IDs, gender information, bonafide source–target references, and transcripts for TTS-generated samples, enabling robust cross-lingual and bias analysis research. Developed at SBILab, IIIT-Delhi. Recognized with Outstanding Paper Award at ACL 2025.

About Dataset

IndicSynth is a large-scale multilingual synthetic speech dataset designed to advance multilingual audio deepfake detection (ADD) and anti-spoofing research, developed at SBILab, IIIT-Delhi. It contains over 4000 hours of synthetic speech across 12 low-resource Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu. The dataset contains rich metadata, including: Synthetic Speech Generative Model, Speaker IDs, Gender, Transcript (if applicable), and File path to synthetic audio. The bona fide source and target speech samples referenced in IndicSynth metadata are drawn from the IndicSUPERB dataset. The transcripts included in the metadata.csv files represent the intended text prompts used during synthetic speech generation via TTS models. We provide these transcripts to enable future explorations, but do not guarantee perfect alignment with the generated audio. If you intend to use IndicSynth for speech-to-text or similar tasks, we strongly recommend conducting careful human evaluation with proficient native speakers of the respective languages. This work was recognized with the Outstanding Paper Award at ACL 2025. If you use IndicSynth, please cite the following papers: 1.) Divya V Sharma, Vijval Ekbote, and Anubha Gupta. 2025. IndicSynth: A Large-Scale Multilingual Synthetic Speech Dataset for Low-Resource Indian Languages. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22037–22060, Vienna, Austria. Association for Computational Linguistics. 2.) Tahir Javed, Kaushal Bhogale, Abhigyan Raman, Pratyush Kumar, Anoop Kunchukuttan, and Mitesh Khapra. 2023. Indicsuperb: A speech processing universal performance benchmark for indian languages. Proceedings of the AAAI Conference on Artificial Intelligence, 37:12942–12950.

Purpose of Dataset

Multilingual Audio Deepfake Detection (Add) Research Or Mitigating Linguistic Biases In Audio Deepfake Detection Systems. Enhancing The Robustness Of Speaker Verification (Sv) Systems Against Spoofing (Impersonation) Attacks And Developing Anti-spoofing Solutions. Cross-lingual Or Gender Bias Studies In Speech Synthesis And Recognition Systems.