
A multilingual text corpus containing sentences in Hindi, Awadhi, Bhojpuri, Braj, and Magahi created for language identification and dialect classification research in Indo-Aryan languages.
This dataset is a multilingual corpus developed for the Indo-Aryan Language Identification (ILI) Shared Task organized as part of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) at COLING 2018. This was aimed at identifying 5 closely-related languages of Indo-Aryan language family - Hindi (also known as Khari Boli), Braj Bhasha, Awadhi, Bhojpuri and Magahi. These languages form part of a continuum starting from Western Uttar Pradesh (Hindi and Braj Bhasha) to Eastern Uttar Pradesh (Awadhi and Bhojpuri) and the neighbouring Eastern state of Bihar (Bhojpuri and Magahi). The dataset contains 15,000 sentences each in five Indo-Aryan languages, designed to support automatic language identification and classification tasks. The citation of the dataset - Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi, and Ahmed Ali. 2018. Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). Association for Computational Linguistics, Santa Fe, New Mexico, USA. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).
The Purpose Of This Dataset Is To Support Automatic Language Identification And Text Classification Research For Closely Related Indo-aryan Languages. It Can Be Used For Language Identification Systems, Multilingual Nlp, Dialect Classification, Language Modeling, Benchmark Evaluation Of Machine Learning Systems, Linguistic Analysis, Low-resource Language Technology Development, And Computational Research Involving Hindi, Awadhi, Bhojpuri, Braj, And Magahi.
Apache 2.0
© 2026 - Copyright AIKosh. All rights reserved.