ORGANISATION

BhasaAnuvaad

BhasaAnuvaad is the largest Indic-language Automatic Speech Translation (AST) dataset, containing over 44,400 hours of speech and 17 million text segments across 13 Indian languages and English. It is designed to facilitate research in speech-to-text translation and multilingual AI.

About Dataset

BhasaAnuvaad is an extensive multilingual speech translation dataset that covers 13 Indian languages along with English. With over 44,400 hours of speech and 17 million aligned text segments, it serves as the largest available Indic-language AST resource. The dataset is sourced from the Spoken-Tutorial YouTube channel and provides parallel speech-to-text translation data. It aims to advance research in automatic speech recognition (ASR), machine translation (MT), and cross-lingual AI applications. The dataset can be used to build and evaluate speech translation models in various Indian languages, supporting both Indic-to-English and English-to-Indic translation tasks. BhasaAnuvaad is publicly available under the CC-BY-4.0 license, making it accessible for academic, industrial, and research use.