ORGANISATION

bhasha-sft_aya_dataset

The "bhasha-sft" dataset, particularly the "aya_dataset" subset, is designed for training and fine-tuning speech recognition models for Indic languages.

About Dataset

The "bhasha-sft" dataset, specifically the "aya_dataset" subset, is a linguistic resource intended to facilitate advancements in machine learning and natural language processing (NLP) for Indic languages. This dataset, sourced from the Bhasha SFT (Speech-to-Text) project, includes a diverse range of conversational data, providing valuable training material for speech recognition systems.

Note on Encoding:
This dataset is encoded in UTF-8 format.

Windows users:
To ensure proper display of non-ASCII characters in Excel, first download the .csv file, open it in Notepad, choose File → Save As, and select UTF-8 with BOM . Then open the saved file in Excel.
macOS users:
You can open the CSV file directly in Excel or any spreadsheet software without any issues.