Home/Use Cases/Multilingual Understanding for Northeast India

ORGANISATION

Multilingual Understanding for Northeast India

NE-BERT is a multilingual ModernBERT model built to understand the diverse languages of Northeast India. It handles classification, named-entity recognition, cross-lingual retrieval, and robust understanding of noisy or code-mixed text. Designed for practical use in governance, research, education, and digital local-language services.

About Use Case

NE-BERT is a foundational language understanding model created by MWire Labs to support the rich linguistic diversity of Northeast India. Built on the ModernBERT architecture, NE-BERT is trained on curated regional corpora spanning Khasi, Garo, Pnar, Assamese, Manipuri (Meitei), Mizo, Nyishi, Kokborok and Nagamese, along with English. The model addresses long-standing challenges in Northeast NLP: orthographic variation, dialect shifts, spelling inconsistencies, and highly code-mixed text. NE-BERT provides strong performance in classification, NER, retrieval, embeddings, and contextual sentence understanding. It is built to serve real-world needs in local governance, citizen services, rural institutions, educational platforms, and community-driven digital applications. Developers can use NE-BERT to build multilingual search engines, automated classification systems, content moderation tools, civic data analyzers, and educational assistants. Its robustness to irregular text makes it suitable for field reports, social media scraping, district-level documents, and unstructured datasets collected across the region. As part of a long-term effort to strengthen AI research in the Northeast, NE-BERT is openly released for public use through Aikosh and HuggingFace.

Source Organisation

MWire Labs

Sector

Social

Resources

External Resources:

Documentation

Related Datasets

Updated 7 month(s) ago

Garo-English Parallel Corpus

A curated set of ~2,500 Garo-English parallel sentence pairs released by MWire Labs to support low-resource translation and experimentation in Northeast Indian languages.

English

Parallel Corpus

low-resource

Garo

northeast-india

tibeto-burman

generated_from_other_dataset

A'chik

MWIRE LABS

View Details

Related Models

NortheastNER

NortheastNER is a token classification model built on XLM-RoBERTa and fine-tuned on ~25k sentences from gazetteers, news, and cultural texts across Northeast India. It detects region-specific entities, places, tribes, festivals, tourist sites, flora, fauna, and experimental local names; ideal for low-resource NER, regional search, cultural analytics, and knowledge graph applications.

Token Classification

NER

northeast-india

low-resource

XLM-RoBERTa

Meghalaya

Conservation

Northeast India

Updated 7 month(s) ago

MWIRE LABS

View Details

NE-BERT

NE-BERT is Northeast India's first domain-specific multilingual foundation model. Built on the ModernBERT architecture and trained on 8.3 million sentences, it supports 9 regional languages: Assamese, Khasi, Garo, Manipuri (Meitei), Mizo, Nyishi, Nagamese, Kokborok, and Pnar. It achieves State-of-the-Art performance on regional benchmarks and offers 1.6x faster inference, bridging the digital divide for low-resource languages.

modernbert

Masked Language Modeling

northeast-india

low-resource-NLP

northeast bert

mwirelabs

token-efficiency

Assamese

Garo

Nyishi

Meitei

Nagamese

khasi

A'chik

Mizo

kokborok

Pnar

Updated 7 month(s) ago

MWIRE LABS

View Details

Accessibility options by UX4G

Multilingual Understanding for Northeast India

About Use Case

Source Organisation

Tags

Sector

Resources

Related Datasets

Related Models

AIKosh

Resources

Support