
English provides a rich and diverse base of Wikipedia content
English, written in the Latin script, serves as the source language for all translations in the soketlabs/bhasha-wiki dataset. As a global lingua franca and the dominant language of science, technology, and academia, English provides a rich and diverse base of Wikipedia content. Its inclusion ensures a strong foundation for multilingual training, allowing models to capture both general world knowledge and linguistic patterns that can be effectively transferred to Indic languages during translation and learning. English acts as the anchor for alignment, cross-lingual understanding, and transfer learning across the dataset.
Note on Encoding:
This dataset is encoded in UTF-8 format.
Windows users:
To ensure proper display of non-ASCII characters in Excel, first download the .csv file, open it in Notepad, choose File → Save As, and select UTF-8 with BOM . Then open the saved file in Excel.
macOS users:
You can open the CSV file directly in Excel or any spreadsheet software without any issues.
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.