IndicST - Indian Multilingual Speech Translation Corpus

IndicST is a speech translation dataset designed for training and evaluating Speech LLMs for Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) tasks. It includes 10.8k hours of training data and 1.13k hours of evaluation data across multiple Indic languages.

About Dataset

IndicST is a large-scale Indian multilingual speech corpus tailored for speech-to-text and speech-to-speech translation tasks. The dataset is designed for training and evaluating Speech Large Language Models (LLMs) in Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST). It features meticulously curated, automatically generated, and manually verified synthetic data. The dataset consists of 10.8k hours of ASR training data from 14 open-source datasets covering nine Indic languages. The translation data is generated using IndicTrans2, supporting two translation modes: one-to-many (English → Indic) and many-to-one (Indic → English). Test sets include both speech-audio-based translations (Kathbath ASR dataset) and text-based translations (AI4Bharat Conv text dataset). The dataset is benchmarked for ASR and AST tasks using Whisper + LLaMA-based LLMs, providing BLEU and CHRF++ scores for in-domain and out-of-domain test cases. IndicST is a valuable resource for low-resource ASR, real-time multilingual communication, and speech AI development.