Indo-Aryan Language Identification Shared Task Dataset

A multilingual text corpus containing sentences in Hindi, Awadhi, Bhojpuri, Braj, and Magahi created for language identification and dialect classification research in Indo-Aryan languages.

About Dataset

This dataset is a multilingual corpus developed for the Indo-Aryan Language Identification (ILI) Shared Task organized as part of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) at COLING 2018. This was aimed at identifying 5 closely-related languages of Indo-Aryan language family - Hindi (also known as Khari Boli), Braj Bhasha, Awadhi, Bhojpuri and Magahi. These languages form part of a continuum starting from Western Uttar Pradesh (Hindi and Braj Bhasha) to Eastern Uttar Pradesh (Awadhi and Bhojpuri) and the neighbouring Eastern state of Bihar (Bhojpuri and Magahi). The dataset contains 15,000 sentences each in five Indo-Aryan languages, designed to support automatic language identification and classification tasks. The citation of the dataset - Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi, and Ahmed Ali. 2018. Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). Association for Computational Linguistics, Santa Fe, New Mexico, USA. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).

Purpose of Dataset

The Purpose Of This Dataset Is To Support Automatic Language Identification And Text Classification Research For Closely Related Indo-aryan Languages. It Can Be Used For Language Identification Systems, Multilingual Nlp, Dialect Classification, Language Modeling, Benchmark Evaluation Of Machine Learning Systems, Linguistic Analysis, Low-resource Language Technology Development, And Computational Research Involving Hindi, Awadhi, Bhojpuri, Braj, And Magahi.

Dataset Metadata

License

Apache 2.0

Geographical coverage

India

Sector

Science, Technology and Research

Author

Ritesh Kumar, Bornini Lahiri, Mayank Jain

Source Organisation

Digital India BHASHINI Division

Uploaded by

Nikil Augustine

Data Quality Score (Beta)

Dataset type

Structured

Frequency

Time Granularity

Static

Year range

N.A.

Date & Time

01/06/26 06:45:00

Visibility

Open

Hosted / Redirected

Redirected

Data Type

Hybrid

If Redirection which source

Github

Data Collection Method

The Data For This Was Collected From Both Hard Printed And Digital Sources. Printed Material Were Obtained From Different Instituitons, Also Gathered Data From Libraries, Local Literary And Cultural Groups, Printed Stories, Novels And Essays In Books, Magazines, And Newspaper. They Scanned The Printed Material, Then Performed Ocr And Used Google Ocr For Hindi, Part Of Drive Api. The Data Was Organized And Labeled By Language Category To Facilitate Supervised Machine Learning And Benchmark Evaluation For Language Identification And Dialect Classification Tasks.