Indian Flag
Government Of India
A-
A
A+

COMI-LINGUA-MLI

This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It classifies each sentence at the sentence level into the dominant matrix language governing the grammatical structure: hi (Hindi) or en (English).

About Model

Hindi-English MLI model

A LoRA-adapted Transformer LLM fine-tuned for sentence-level Matrix Language Identification (MLI) on Hindi–English (Hinglish) code-mixed text.


Supported Labels
  • hi - Hindi matrix (dominant grammatical structure follows Hindi)
  • en - English matrix (dominant grammatical structure follows English)

Model Overview
  • Model type: LoRA-adapted Transformer LLM
  • Base model: CohereForAI/aya-expanse-8b
  • Total parameters: 8B
  • Trainable parameters: ~32M
  • License: Apache 2.0
  • Languages: Hindi, English

Performance

Achieves 98.77 F1 (with 98.90 Precision and 98.77 Recall) on the COMI-LINGUA MLI test set (5K instances), setting state-of-the-art (SOTA) among open-weight models, outperforming zero-shot closed LLMs (e.g., gpt-4o ≈ 98.0 F1) and traditional tools.

Setting Precision Recall F1-score
Zero-shot 98.71 59.56 74.25
One-shot 98.35 81.36 89.00
Fine-tuned 98.90 98.77 94.94

Example Inference

Identify the matrix language (hi = Hindi matrix, en = English matrix) in the sentence:

PM Narendra Modi ne Google CEO Sundar Pichai se mulakat ki.

Output: 'hi'


Citation
@inproceedings{sheth-etal-2025-comi,
  title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
  author = "Sheth, Rajvee and
               Beniwal, Himanshu and
               Singh, Mayank",
  editor = "Christodoulopoulos, Christos and
               Chakraborty, Tanmoy and
               Rose, Carolyn and
               Peng, Violet",
  booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
  month = nov,
  year = "2025",
  address = "Suzhou, China",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2025.findings-emnlp.422/",
  pages = "7973--7992",
  isbn = "979-8-89176-335-7"
}

COMI-LINGUA-MLI

Metadata Metadata

Apache 2.0

Rajvee Sheth, Mayank Singh

Transformers

Transformers

Open

IITGN

Science, Technology and Research

10/02/26 05:48:06

1.89 GB

Activity Overview Activity Overview

  • Downloads0
  • Downloads 0
  • Views 69
  • File Size 1.89 GB

Tags Tags

  • Code-Mixing
  • Hinglish

License Control License Control

Apache 2.0

Version Control Version Control

FolderVersion 1(1.89 GB)
  • admin·1 month(s) ago
    • undefined
      .DS_Store
    • application/json
      adapter_config.json
    • undefined
      adapter_model.safetensors
    • undefined
      chat_template.jinja
    • undefined
      optimizer.pt
    • text/markdown
      README.md
    • undefined
      rng_state.pth
    • undefined
      scaler.pt
    • undefined
      scheduler.pt
    • more_horiz 4 more

More Models from IITGN More Models from IITGN

COMI-LINGUA-POS
This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It assigns a grammatical category to each token using a language-agnostic Universal POS tagset suitable for code-mixed content in Roman and Devanagari scripts.
Hinglish
  • See Upvoters0
  • Downloads0
  • File Size979.67 MB
  • Views69
Updated 30 day(s) ago

IITGN

COMI-LINGUA-MT
This is a fine-tuned version of Llama-3.1-8B-Instruct for Machine Translation (MT) on Hinglish (Hindi-English code-mixed) text. It translates code-mixed input in Roman/Devanagari scripts to three target formats: (i) Standard English, (ii) Romanized Hindi, and (iii) Devanagari Hindi.
Code-Mixing
Hinglish
  • See Upvoters0
  • Downloads0
  • File Size1.89 GB
  • Views82
Updated 30 day(s) ago

IITGN

COMI-LINGUA-MLI
This is a fine-tuned version of aya-expanse-8b for Part-of-Speech (POS) Tagging on Hinglish (Hindi-English code-mixed) text. It classifies each sentence at the sentence level into the dominant matrix language governing the grammatical structure: hi (Hindi) or en (English).
Hinglish
Code-Mixing
  • See Upvoters0
  • Downloads0
  • File Size1.89 GB
  • Views70
Updated 30 day(s) ago

IITGN

COMI-LINGUA-LID
This is a fine-tuned version of aya-expanse-8b for Token-level Language Identification (LID) on Hinglish (Hindi-English code-mixed) text. It performs token-wise classification into three categories: en (English), hi (Hindi), or ot (Other).
Code-Mixing
Hinglish
  • See Upvoters0
  • Downloads0
  • File Size1.89 GB
  • Views72
Updated 30 day(s) ago

IITGN

COMI-LINGUA-NER
This is a fine-tuned version of aya-expanse-8b for Named Entity Recognition (NER) on Hinglish (Hindi-English code-mixed) text. It helps with token-level entity tagging (PERSON, ORGANISATION, LOCATION, DATE, TIME, GPE, HASHTAG, EMOJI, MENTION, X/Other) in Roman/Devanagari scripts. Achieves 94.90 F1 on COMI-LINGUA test set (5K instances), outperforming the zero-shot inference (59.88 F1).
Code-Mixing
Hinglish
  • See Upvoters0
  • Downloads0
  • File Size979.66 MB
  • Views71
Updated 30 day(s) ago

IITGN

Ganga-2-1B
The first pre-trained Hindi model by any academic research lab in India 🇮🇳!
Text Generation
  • See Upvoters1
  • Downloads37
  • File Size1.88 GB
  • Views393
Updated 8 month(s) ago

IITGN