Indian Flag
Government Of India
A-
A
A+
COMI-LINGUA

COMI-LINGUA

COde-MIxing and LINGuistic Insights on Natural Hinglish Usage and Annotation

About Dataset

COMI-LINGUA (COde-MIxing and LINGuistic Insights on Natural Hinglish Usage and Annotation) is a high-quality Hindi-English code-mixed dataset, manually annotated by three expert annotators. It serves as a multitask benchmark for multilingual NLP models.

COMI-LINGUA provides annotations for several key NLP tasks:

1. Language Identification (LID): Token-wise classification of Hindi, English, and other linguistic units.

Initial predictions were generated using the Microsoft LID tool, which annotators then reviewed and corrected.

  • Example sentence: प्रधानमंत्री  नरेन्द्र  मोदी  डिजिटल  इंडिया  मिशन  को  आगे  बढ़ाने  के  लिए  पिछले  सप्ताह  Google  के CEO सुंदर पिचाई  से  मुलाकात  की  थी ।
  • LID tags:   hi   hi   hi   en   en  en   hi  hi   hi  hi  hi  hi  hi  en  hi  en  hi   hi  hi   hi  hi  hi  ot

2. Matrix Language Identification (MLI): Sentence-level annotation of the dominant language.

  • Example sentence: किसानों को अपनी फसल बेचने में दिक्कत न हो इसके लिये Electronic National       Agriculture Market यानि ई-नाम योजना तेजी से काम हो रहा है। 
  • Matrix Language: hi

3. Part-of-Speech (POS) Tagging: Syntactic categorization for linguistic analysis.

Tags were pre-assigned using the CodeSwitch NLP library, which annotators then reviewed and corrected.

  • Example sentence:  भारत  द्वारा  बनाया  गया Unified Payments Interface यानि  UPI  भारत  की  एक   बहुत  बड़ी   success  story  है  ।
  • POS tags: PROPN  ADP VERB VERB  PROPN   PROPN  PROPN  CONJ  PROPN  PROPN ADP  DET  ADJ  ADJ  NOUN  NOUN  VERB  X

4. Named Entity Recognition (NER): Identification of named entities in Hinglish text.

Each token is pre-assigned a language tag using the CodeSwitch NLP library, which annotators then reviewed and corrected.

  • Example sentence: मालूम हो कि पेरिस स्थित Financial Action Task Force, FATF ने जून 2018 में पाकिस्तान को ग्रे लिस्ट में रखा था।
  • NER tags: "पेरिस": GPE, "Financial Action Task Force, FATF": ORGANISATION, "2018": Date, "पाकिस्तान": GPE

5. Machine Translation (MT): Parallel translation of sentences in Romanized Hindi and Devanagari Hindi and English languages.

  • Example Sentence: भारत में भी green growth, climate resilient infrastructure और ग्रीन transition पर विशेष रूप से बल दिया जा रहा है।
  • English: In India too, special emphasis is being given to green growth, climate resilient infrastructure, and green transition.
  • Romanized Hindi: Bharat mein bhi green growth, climate resilient infrastructure aur green transition par vishesh roop se bal diya ja raha hai.
  • Devanagari Hindi: भारत में भी हरित विकास, जलवायु सहनशील आधारिक संरचना और हरित संक्रमण पर विशेष रूप से बल दिया जा रहा है।
  • Citation:

    @inproceedings{sheth-etal-2025-comi,
        title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
        author = "Sheth, Rajvee  and
          Beniwal, Himanshu  and
          Singh, Mayank",
        editor = "Christodoulopoulos, Christos  and
          Chakraborty, Tanmoy  and
          Rose, Carolyn  and
          Peng, Violet",
        booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
        month = nov,
        year = "2025",
        address = "Suzhou, China",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2025.findings-emnlp.422/",
        pages = "7973--7992",
        ISBN = "979-8-89176-335-7",
    }
    

Purpose of Dataset

This Dataset Is Curated To Fill The Gap In High-quality Hinglish Resources By Offering Expert-reviewed Annotations For Key Nlp Tasks.

Activity Overview Activity Overview

  • Downloads0
  • Downloads 10
  • Views 397
  • File Size 51.71 MB

Tags Tags

  • Code-Mixing
  • Data-annotation
  • Hinglish
  • expert-annotated

License Control License Control

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

No Record(s) Found

Select a file to preview its contents.

Data Quality Score BetaData Quality Score Beta

Version Control Version Control

FolderVersion 2(51.71 MB)
  • admin·7 month(s) ago
    • text/csv
      POS_train.csv

Related Models Related Models

COMI-LINGUA-NER
This is a fine-tuned version of aya-expanse-8b for Named Entity Recognition (NER) on Hinglish (Hindi-English code-mixed) text. It helps with token-level entity tagging (PERSON, ORGANISATION, LOCATION, DATE, TIME, GPE, HASHTAG, EMOJI, MENTION, X/Other) in Roman/Devanagari scripts. Achieves 94.90 F1 on COMI-LINGUA test set (5K instances), outperforming the zero-shot inference (59.88 F1).
Code-Mixing
Hinglish
  • See Upvoters0
  • Downloads0
  • File Size979.66 MB
  • Views37
Updated 11 day(s) ago

IITGN

COMI-LINGUA-LID
This is a fine-tuned version of aya-expanse-8b for Token-level Language Identification (LID) on Hinglish (Hindi-English code-mixed) text. It performs token-wise classification into three categories: en (English), hi (Hindi), or ot (Other).
Code-Mixing
Hinglish
  • See Upvoters0
  • Downloads0
  • File Size1.89 GB
  • Views40
Updated 11 day(s) ago

IITGN

COMI-LINGUA-MLI
This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It classifies each sentence at the sentence level into the dominant matrix language governing the grammatical structure: hi (Hindi) or en (English).
Code-Mixing
Hinglish
  • See Upvoters0
  • Downloads0
  • File Size1.89 GB
  • Views44
Updated 11 day(s) ago

IITGN

COMI-LINGUA-MT
This is a fine-tuned version of Llama-3.1-8B-Instruct for Machine Translation (MT) on Hinglish (Hindi-English code-mixed) text. It translates code-mixed input in Roman/Devanagari scripts to three target formats: (i) Standard English, (ii) Romanized Hindi, and (iii) Devanagari Hindi.
Hinglish
Code-Mixing
  • See Upvoters0
  • Downloads0
  • File Size1.89 GB
  • Views46
Updated 11 day(s) ago

IITGN

COMI-LINGUA-POS
This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It assigns a grammatical category to each token using a language-agnostic Universal POS tagset suitable for code-mixed content in Roman and Devanagari scripts.
Hinglish
  • See Upvoters0
  • Downloads0
  • File Size979.67 MB
  • Views43
Updated 12 day(s) ago

IITGN