ORGANISATION

bhasha‑wiki-Urdu

Urdu uses the Perso-Arabic script and is spoken in parts of North India and widely in Pakistan,

About Dataset

Urdu uses the Perso-Arabic script and is spoken in parts of North India and widely in Pakistan, serving as a bridge between Indo-Aryan linguistic roots and Islamic cultural traditions. Its inclusion in the soketlabs/bhasha-wiki dataset enhances script diversity and introduces stylistic and cultural nuances. Urdu contributes valuable content in poetry, media, and historical discourse, making it an important language for building culturally sensitive and versatile AI systems.

Note on Encoding:
This dataset is encoded in UTF-8 format.

Windows users:
To ensure proper display of non-ASCII characters in Excel, first download the .csv file, open it in Notepad, choose File → Save As, and select UTF-8 with BOM . Then open the saved file in Excel.
macOS users:
You can open the CSV file directly in Excel or any spreadsheet software without any issues.