
COde-MIxing and LINGuistic Insights on Natural Hinglish Usage and Annotation
COMI-LINGUA (COde-MIxing and LINGuistic Insights on Natural Hinglish Usage and Annotation) is a high-quality Hindi-English code-mixed dataset, manually annotated by three expert annotators. It serves as a multitask benchmark for multilingual NLP models.
COMI-LINGUA provides annotations for several key NLP tasks:
1. Language Identification (LID): Token-wise classification of Hindi, English, and other linguistic units.
Initial predictions were generated using the Microsoft LID tool, which annotators then reviewed and corrected.
2. Matrix Language Identification (MLI): Sentence-level annotation of the dominant language.
3. Part-of-Speech (POS) Tagging: Syntactic categorization for linguistic analysis.
Tags were pre-assigned using the CodeSwitch NLP library, which annotators then reviewed and corrected.
4. Named Entity Recognition (NER): Identification of named entities in Hinglish text.
Each token is pre-assigned a language tag using the CodeSwitch NLP library, which annotators then reviewed and corrected.
5. Machine Translation (MT): Parallel translation of sentences in Romanized Hindi and Devanagari Hindi and English languages.
Citation:
@inproceedings{sheth-etal-2025-comi,
title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
author = "Sheth, Rajvee and
Beniwal, Himanshu and
Singh, Mayank",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.422/",
pages = "7973--7992",
ISBN = "979-8-89176-335-7",
}
This Dataset Is Curated To Fill The Gap In High-quality Hinglish Resources By Offering Expert-reviewed Annotations For Key Nlp Tasks.
Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
No File(s) Found!
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.