ORGANISATION

COMI-LINGUA

COde-MIxing and LINGuistic Insights on Natural Hinglish Usage and Annotation

About Dataset

COMI-LINGUA (COde-MIxing and LINGuistic Insights on Natural Hinglish Usage and Annotation) is a high-quality Hindi-English code-mixed dataset, manually annotated by three expert annotators. It serves as a multitask benchmark for multilingual NLP models.

COMI-LINGUA provides annotations for the following key NLP tasks:

1. Language Identification (LID): Token-wise classification of Hindi, English, and other units.

Initial predictions generated using the Microsoft LID tool; annotators reviewed and corrected.

Example sentence: प्रधानमंत्री नरेन्द्र मोदी डिजिटल इंडिया मिशन को आगे बढ़ाने के लिए पिछले सप्ताह Google के CEO सुंदर पिचाई से मुलाकात की थी ।
LID tags: hi hi hi en en en hi hi hi hi hi hi hi en hi en hi hi hi hi hi ot

2. Matrix Language Identification (MLI): Sentence-level dominant language annotation.

Example sentence: किसानों को अपनी फसल बेचने में दिक्कत न हो इसके लिये Electronic National Agriculture Market यानि ई-नाम योजना तेजी से काम हो रहा है।
Matrix Language: hi

3. Part-of-Speech (POS) Tagging: Syntactic categorization of tokens.

Tags pre-assigned using the CodeSwitch NLP library; annotators reviewed and corrected.

Example sentence: भारत द्वारा बनाया गया Unified Payments Interface यानि UPI भारत की एक बहुत बड़ी success story है ।
POS tags: PROPN ADP VERB VERB PROPN PROPN PROPN CONJ PROPN PROPN ADP DET ADJ ADJ NOUN NOUN VERB X

4. Named Entity Recognition (NER): Identification of named entities in Hinglish text.

Tokens pre-tagged for language using the CodeSwitch NLP library; annotators reviewed and corrected.

Example sentence: मालूम हो कि पेरिस स्थित Financial Action Task Force, FATF ने जून 2018 में पाकिस्तान को ग्रे लिस्ट में रखा था।
NER tags: "पेरिस": GPE, "Financial Action Task Force, FATF": ORGANISATION, "2018": DATE, "पाकिस्तान": GPE

5. Machine Translation (MT): Parallel translation in three variants: English, Romanized Hindi, Devanagari Hindi. Initial translation predictions were generated using the Llama 3.3 LLM; which annotators then refined and corrected by annotators.

Example Sentence: भारत में भी green growth, climate resilient infrastructure और ग्रीन transition पर विशेष रूप से बल दिया जा रहा है।
English: In India too, special emphasis is being given to green growth, climate resilient infrastructure, and green transition.
Romanized Hindi: Bharat mein bhi green growth, climate resilient infrastructure aur green transition par vishesh roop se bal diya ja raha hai.
Devanagari Hindi: भारत में भी हरित विकास, जलवायु सहनशील आधारिक संरचना और हरित संक्रमण पर विशेष रूप से बल दिया जा रहा है।

6. Text Normalization (TN): Sentence-level normalization of noisy, informal, code-mixed Hinglish to standardized form. Common variants (e.g. hain/hai/hayn/hay/he → hain) mapped consistently. Initial suggestions from GPT-OSS-120B; annotators reviewed and corrected.

Example Sentence: Janmdin ki dheron shubhkaamnaayen sir... Aap swasth rahen... Dirghayu ho... Yahi hamare poore pariwaar ki shubhkaamnaayen hain
Normalized: Janmadin ki dheron shubhkaamnayein sir. Aap swasth rahe. Dirghayu ho, yahi hamare pure parivaar ki shubhkaamnayein hain.

Citation:

@inproceedings{sheth-etal-2025-comi,
    title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
    author = "Sheth, Rajvee  and
      Beniwal, Himanshu  and
      Singh, Mayank",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.422/",
    pages = "7973--7992",
    ISBN = "979-8-89176-335-7",
}

Purpose of Dataset

This Dataset Is Curated To Fill The Gap In High-quality Hinglish Resources By Offering Expert-reviewed Annotations For Key Nlp Tasks.

Dataset Metadata

License

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Geographical coverage

Global

Sector

Science, Technology and Research

Author

Rajvee Sheth, Mayank Singh

Source Organisation

IITGN

Uploaded by

Lingo Research Group

Data Quality Score (Beta)

4.5

Dataset type

Structured

Frequency

Daily

Time Granularity

Five Yearly

Year range

N.A.

Date & Time

16/07/25 11:25:41

Visibility

Open

Hosted / Redirected

Hosted

Data Type

Primary

Activity Overview

0
30
51.71 MB
664

License Control

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Select a file to preview its contents.

Data Quality Score Beta

Version Control

Version 2(51.71 MB)

admin·11 month(s) ago
- POS_train.csv

Version 1(350.02 MB)

admin·11 month(s) ago

No File(s) Found!

Related Models

COMI-LINGUA-POS

This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It assigns a grammatical category to each token using a language-agnostic Universal POS tagset suitable for code-mixed content in Roman and Devanagari scripts.

Hinglish

0
4
979.67 MB
131

Updated 3 month(s) ago

IITGN

View Details

COMI-LINGUA-MT

This is a fine-tuned version of Llama-3.1-8B-Instruct for Machine Translation (MT) on Hinglish (Hindi-English code-mixed) text. It translates code-mixed input in Roman/Devanagari scripts to three target formats: (i) Standard English, (ii) Romanized Hindi, and (iii) Devanagari Hindi.

Hinglish

Code-Mixing

0
6
1.89 GB
141

Updated 3 month(s) ago

IITGN

View Details

COMI-LINGUA-MLI

This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It classifies each sentence at the sentence level into the dominant matrix language governing the grammatical structure: hi (Hindi) or en (English).

Code-Mixing

Hinglish

0
3
1.89 GB
128

Updated 3 month(s) ago

IITGN

View Details

COMI-LINGUA-LID

This is a fine-tuned version of aya-expanse-8b for Token-level Language Identification (LID) on Hinglish (Hindi-English code-mixed) text. It performs token-wise classification into three categories: en (English), hi (Hindi), or ot (Other).

Code-Mixing

Hinglish

0
14
1.89 GB
178

Updated 3 month(s) ago

IITGN

View Details

COMI-LINGUA-NER

This is a fine-tuned version of aya-expanse-8b for Named Entity Recognition (NER) on Hinglish (Hindi-English code-mixed) text. It helps with token-level entity tagging (PERSON, ORGANISATION, LOCATION, DATE, TIME, GPE, HASHTAG, EMOJI, MENTION, X/Other) in Roman/Devanagari scripts. Achieves 94.90 F1 on COMI-LINGUA test set (5K instances), outperforming the zero-shot inference (59.88 F1).

Hinglish

Code-Mixing

0
8
979.66 MB
175

Updated 3 month(s) ago

IITGN

View Details

Accessibility options by UX4G

COMI-LINGUA

About Dataset

Purpose of Dataset

Dataset Metadata

Activity Overview

Tags

License Control

Select a file to preview its contents.

Data Quality Score Beta

Version Control

Version 2(51.71 MB)

POS_train.csv

Version 1(350.02 MB)

Related Models

AIKosh

Resources

Support