This is a fine-tuned version of aya-expanse-8b for Token-level Language Identification (LID) on Hinglish (Hindi-English code-mixed) text. It performs token-wise classification into three categories: en (English), hi (Hindi), or ot (Other).
A LoRA-adapted Transformer LLM fine-tuned for token-level Language Identification (LID) on Hindi–English (Hinglish) code-mixed text.
en - Englishhi - Hindiot - Other (punctuation, symbols, hashtags, emojis, URLs, mixed/unknown script elements)CohereForAI/aya-expanse-8bAchieves 94.90 F1 on the COMI-LINGUA LID test set (5K instances), outperforming zero-shot inference from strong closed-source LLMs (e.g., gpt-4o ≈ 92.7 F1) and traditional tools (e.g., Microsoft LID ≈ 74.4 F1), establishing strong performance among open-weight models for fine-grained Hinglish language tagging.
| Setting | Precision | Recall | F1-score |
|---|---|---|---|
| Zero-shot | 51.08 | 70.55 | 59.05 |
| One-shot | 73.03 | 71.07 | 70.48 |
| Fine-tuned | 87.45 | 86.92 | 94.90 |
Identify the language of each token (en = English, hi = Hindi, ot = Other) in the sentence:
New Delhi/Alive News : आज के दिन में इंडिया की टीम ने ऑस्ट्रेलिया को हराया। #INDvAUS
Output:
[
{"New": "en"},
{"Delhi": "en"},
{"/": "ot"},
{"Alive": "en"},
{"News": "en"},
{":": "ot"},
{"आज": "hi"},
{"के": "hi"},
{"दिन": "hi"},
{"में": "hi"},
{"इंडिया": "en"},
{"की": "hi"},
{"टीम": "en"},
{"ने": "hi"},
{"ऑस्ट्रेलिया": "en"},
{"को": "hi"},
{"हराया": "hi"},
{"।": "ot"},
{"#INDvAUS": "ot"}
]
@inproceedings{sheth-etal-2025-comi,
title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
author = "Sheth, Rajvee and
Beniwal, Himanshu and
Singh, Mayank",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.422/",
pages = "7973--7992",
isbn = "979-8-89176-335-7"
}Apache 2.0
Rajvee Sheth, Mayank Singh
Transformers
Transformers
Open
Science, Technology and Research
10/02/26 11:31:01
1.89 GB
Apache 2.0
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.