
Hindi is written in the Devanagari script and is the most widely spoken language in India,
Hindi is written in the Devanagari script and is the most widely spoken language in India, primarily used across the northern and central regions. Its inclusion in the soketlabs/bhasha-wiki dataset ensures that language models can learn from and cater to a broad user base, covering general knowledge, government communication, education, and popular media. With its deep cultural significance and widespread usage, Hindi is a cornerstone for building Indic-aware AI systems.
Note on Encoding:
This dataset is encoded in UTF-8 format.
Windows users:
To ensure proper display of non-ASCII characters in Excel, first download the .csv file, open it in Notepad, choose File → Save As, and select UTF-8 with BOM . Then open the saved file in Excel.
macOS users:
You can open the CSV file directly in Excel or any spreadsheet software without any issues.
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.