Home/Models/Northeast Language Identification

ORGANISATION

Northeast Language Identification

NE-LID is a fast and accurate language identification model for Northeast Indian languages using character level features. It is designed for low resource and script diverse text and achieves high accuracy on short sentences.

About Model

NE LID is a sentence level language identification model developed for low resource languages of Northeast India. The model supports ten languages including Assamese, Bodo, Garo, Hindi, Khasi, Kokborok, Meitei, Mizo, Naga and Nyishi. The model is trained using a fastText supervised classifier with character n gram features which makes it robust to spelling variation short text and multiple writing scripts. A balanced dataset with two thousand sentences per language was used with a stratified train dev test split. Extensive evaluation shows that the model achieves around 99% accuracy and outperforms transformer based language models for this task. Experiments indicate that character level approaches are more effective than subword based transformers for language identification in low resource and script diverse settings. The model is suitable for use in language routing data filtering preprocessing for machine translation speech recognition and other downstream language technology applications in the Northeast India context.

Northeast Language Identification

Metadata

License

Attribution 4.0 International (CC BY- 4.0)

Hosted By

MWirelabs

Task Type

Classification Model

Model Format

Other

Visibility

Open

Source Organisation

MWire Labs

Sector

Social

Updated Date & Time

14/01/26 15:04:29

Created By

Badal Nyalang

Size

Activity Overview

License Control

Attribution 4.0 International (CC BY- 4.0)

Related Models

More Models from MWire Labs

Northeast STT Multilingual Speech to Text Model

A multilingual Speech-to-Text (STT) model for eight Northeast Indian languages, fine-tuned from Whisper Medium using over 150,000 speech-text pairs from public and institutional datasets. The model expands speech recognition support for low-resource indigenous languages, including Khasi, Garo, Mizo, Kokborok, Nagamese, Assamese, Chakma, and Wancho.

Northeast India Languages

northeast-india

low-resource

Automatic Speech Recognition

Multilingual

Speech to Text

Multilingual speech

Speech processing

whisper

Updated 3 day(s) ago

MWIRE LABS

View Details

NE-SpeechEmbed

NE-SpeechEmbed is a multilingual speech-text embedding model by MWire Labs for Northeast Indian languages. The model supports semantic speech search, cross-modal retrieval, and audio-text embeddings across Khasi, Garo, Mizo, Nagamese, Kokborok, Assamese, Wancho, and Chakma.

speech-embeddings

speech-text-retrieval

audio-text-retrieval

multilingual

speech-language-model

mwire-labs

speech

XLM-RoBERTa

low-resource

retrieval

northeast-india

low-resource-languages

whisper

embeddings

Updated 6 day(s) ago

MWIRE LABS

View Details

NE-Embed

NE-Embed is a multilingual text embedding model for Northeast Indian languages, enabling semantic search, retrieval, and RAG across 10 languages including Khasi, Garo, Meitei, Bodo, Mizo, Assamese, Nyishi, Kokborok, Pnar, and Nagamese. Fine-tuned on LaBSE with 201,738 parallel pairs.

retrieval

Multilingual

rag

embeddings

northeast-india

sentence-transformers

low-resource

Updated 6 day(s) ago

MWIRE LABS

View Details

Mizo OCR - Text Recognition for Mizo Language

OCR model for the Mizo language achieving 90.68% character accuracy on synthetic and curated printed text

Image-to-Text

OCR

low-resource

northeast-india

Mizo

trocr

Updated 3 month(s) ago

MWIRE LABS

View Details

NE-OCR

NE-OCR is a multilingual Optical Character Recognition model developed by MWire Labs to accurately recognize printed text from documents in Northeast Indian languages. The model supports Assamese, Bodo, English, Garo, Hindi, Khasi, Kokborok, Meitei (Bengali script), Meitei (Meitei Mayek script), Mizo, Nagamese, and Nyishi. It is designed to enable reliable digitization of books, newspapers, government records, educational materials, and cultural archives from Northeast India where mainstream OCR

vitstr

Multilingual OCR

Northeast India OCR

Printed Text Recognition

OCR

Garo

Kokborok

Nyishi

Meitei

northeast-india

Nagamese

khasi

Mizo

Optical Character Recognition

BODO

doctr

Updated 3 month(s) ago

MWIRE LABS

View Details

Nagamese Speech-to-Text

Automatic Speech Recognition (ASR) model for Nagamese speech, designed to transcribe spoken Nagamese into text for real-world usage.

Automatic Speech Recognition

low-resource-language

whisper

Nagamese

Speech Recognition

ASR

Updated 3 month(s) ago

MWIRE LABS

View Details

Garo OCR - Text Recognition for Garo

OCR model for the Garo language achieving 93.13% character accuracy.

Garo

northeast-india

florence-2

Image-to-Text

OCR

Updated 3 month(s) ago

MWIRE LABS

View Details

Northeast Language Identification

low-resource

Multilingual

language identification

fastText

fasttext

MWire Labs

northeast-india

Updated 6 month(s) ago

MWIRE LABS

View Details

NortheastNER

NortheastNER is a token classification model built on XLM-RoBERTa and fine-tuned on ~25k sentences from gazetteers, news, and cultural texts across Northeast India. It detects region-specific entities, places, tribes, festivals, tourist sites, flora, fauna, and experimental local names; ideal for low-resource NER, regional search, cultural analytics, and knowledge graph applications.

northeast-india

low-resource

XLM-RoBERTa

NER

Token Classification

Conservation

Northeast India

Meghalaya

Updated 7 month(s) ago

MWIRE LABS

View Details

Kren-M

Northeast India's first AI language model. Kren-M is a 2.6B parameter bilingual model for Khasi-English, built on Gemma-2-2B. Features Kren-NE custom tokenizer covering 7 NE languages (Khasi, Garo, Mizo, Assamese, Manipuri, Nagamese, Nyishi) with 35.7% efficiency gain. Trained on 5.43M Khasi sentences. Capabilities: bidirectional translation, natural conversation, cultural context. Designed for language preservation across Northeast India

Garo

low-resource

Instruction-Tuning

Indian Languages

continued-pretraining

bilingual

Kren-M

Northeast India Languages

Foundational model

Tokenizer

Northeast India

khasi

northeast-india

Updated 7 month(s) ago

MWIRE LABS

View Details

Accessibility options by UX4G

Northeast Language Identification

About Model

Northeast Language Identification

Metadata

Activity Overview

Tags

License Control

Related Models

More Models from MWire Labs

AIKosh

Resources

Support