Indian Flag
Government Of India
A-
A
A+
Indo-Aryan Language Identification Shared Task Dataset

Indo-Aryan Language Identification Shared Task Dataset

A multilingual text corpus containing sentences in Hindi, Awadhi, Bhojpuri, Braj, and Magahi created for language identification and dialect classification research in Indo-Aryan languages.

About Dataset

This dataset is a multilingual corpus developed for the Indo-Aryan Language Identification (ILI) Shared Task organized as part of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) at COLING 2018. This was aimed at identifying 5 closely-related languages of Indo-Aryan language family - Hindi (also known as Khari Boli), Braj Bhasha, Awadhi, Bhojpuri and Magahi. These languages form part of a continuum starting from Western Uttar Pradesh (Hindi and Braj Bhasha) to Eastern Uttar Pradesh (Awadhi and Bhojpuri) and the neighbouring Eastern state of Bihar (Bhojpuri and Magahi). The dataset contains 15,000 sentences each in five Indo-Aryan languages, designed to support automatic language identification and classification tasks. The citation of the dataset - Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi, and Ahmed Ali. 2018. Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). Association for Computational Linguistics, Santa Fe, New Mexico, USA. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s).

Purpose of Dataset

The Purpose Of This Dataset Is To Support Automatic Language Identification And Text Classification Research For Closely Related Indo-aryan Languages. It Can Be Used For Language Identification Systems, Multilingual Nlp, Dialect Classification, Language Modeling, Benchmark Evaluation Of Machine Learning Systems, Linguistic Analysis, Low-resource Language Technology Development, And Computational Research Involving Hindi, Awadhi, Bhojpuri, Braj, And Magahi.

Activity Overview Activity Overview

  • Downloads0
  • Redirect 1
  • File Size 0
  • Views 8

Tags Tags

  • Indian Languages
  • Machine Learning
  • Indo-Aryan
  • dialect identification
  • language identification
  • benchmark dataset
  • Indian Language
  • multilingual corpus
  • Bhojpuri
  • Awadhi
  • Magahi
  • Indic Languages
  • Bihar
  • Uttar Pradesh
  • Delhi
  • low-resource-language
  • Low Resource NLP
  • Multilingual NLP
  • Non-English NLP
  • Indic NLP
  • Hindi Language
  • Text Classification
  • Natural language processing

License Control License Control

Apache 2.0