The "bhasha-sft" dataset, particularly the "aya_dataset" subset, is designed for training and fine-tuning speech recognition models for Indic languages.
The "bhasha-sft" dataset, specifically the "aya_dataset" subset, is a linguistic resource intended to facilitate advancements in machine learning and natural language processing (NLP) for Indic languages. This dataset, sourced from the Bhasha SFT (Speech-to-Text) project, includes a diverse range of conversational data, providing valuable training material for speech recognition systems.
Note on Encoding:
This dataset is encoded in UTF-8 format.
Windows users:
To ensure proper display of non-ASCII characters in Excel, first download the .csv file, open it in Notepad, choose File → Save As, and select UTF-8 with BOM . Then open the saved file in Excel.
macOS users:
You can open the CSV file directly in Excel or any spreadsheet software without any issues.
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.