
COde-MIxing and LINGuistic Insights on Natural Hinglish Usage and Annotation
COMI-LINGUA (COde-MIxing and LINGuistic Insights on Natural Hinglish Usage and Annotation) is a high-quality Hindi-English code-mixed dataset, manually annotated by three expert annotators. It serves as a multitask benchmark for multilingual NLP models.
COMI-LINGUA provides annotations for the following key NLP tasks:
1. Language Identification (LID): Token-wise classification of Hindi, English, and other units.
Initial predictions generated using the Microsoft LID tool; annotators reviewed and corrected.
2. Matrix Language Identification (MLI): Sentence-level dominant language annotation.
3. Part-of-Speech (POS) Tagging: Syntactic categorization of tokens.
Tags pre-assigned using the CodeSwitch NLP library; annotators reviewed and corrected.
4. Named Entity Recognition (NER): Identification of named entities in Hinglish text.
Tokens pre-tagged for language using the CodeSwitch NLP library; annotators reviewed and corrected.
5. Machine Translation (MT): Parallel translation in three variants: English, Romanized Hindi, Devanagari Hindi. Initial translation predictions were generated using the Llama 3.3 LLM; which annotators then refined and corrected by annotators.
6. Text Normalization (TN): Sentence-level normalization of noisy, informal, code-mixed Hinglish to standardized form. Common variants (e.g. hain/hai/hayn/hay/he → hain) mapped consistently. Initial suggestions from GPT-OSS-120B; annotators reviewed and corrected.
Citation:
@inproceedings{sheth-etal-2025-comi,
title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
author = "Sheth, Rajvee and
Beniwal, Himanshu and
Singh, Mayank",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.422/",
pages = "7973--7992",
ISBN = "979-8-89176-335-7",
}
This Dataset Is Curated To Fill The Gap In High-quality Hinglish Resources By Offering Expert-reviewed Annotations For Key Nlp Tasks.
Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
No File(s) Found!
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.