Indian Flag
Government Of India
A-
A
A+

Multilingual Understanding for Northeast India

NE-BERT is a multilingual ModernBERT model built to understand the diverse languages of Northeast India. It handles classification, named-entity recognition, cross-lingual retrieval, and robust understanding of noisy or code-mixed text. Designed for practical use in governance, research, education, and digital local-language services.

About Use Case

NE-BERT is a foundational language understanding model created by MWire Labs to support the rich linguistic diversity of Northeast India. Built on the ModernBERT architecture, NE-BERT is trained on curated regional corpora spanning Khasi, Garo, Pnar, Assamese, Manipuri (Meitei), Mizo, Nyishi, Kokborok and Nagamese, along with English. The model addresses long-standing challenges in Northeast NLP: orthographic variation, dialect shifts, spelling inconsistencies, and highly code-mixed text. NE-BERT provides strong performance in classification, NER, retrieval, embeddings, and contextual sentence understanding. It is built to serve real-world needs in local governance, citizen services, rural institutions, educational platforms, and community-driven digital applications. Developers can use NE-BERT to build multilingual search engines, automated classification systems, content moderation tools, civic data analyzers, and educational assistants. Its robustness to irregular text makes it suitable for field reports, social media scraping, district-level documents, and unstructured datasets collected across the region. As part of a long-term effort to strengthen AI research in the Northeast, NE-BERT is openly released for public use through Aikosh and HuggingFace.

Source Organization Source Organization

MWire Labs

Tags Tags

  • Northeast India
  • Northeast India Languages
  • northeast bert
  • language-modelling
  • multilingual-nlp
  • low-resource-languages
  • text-classification

Tags Sector

Social

Resources Resources

External Resources:

Related Datasets Related Datasets

Updated 3 month(s) ago
Garo-English Parallel Corpus
Garo-English Parallel Corpus
Information-
A curated set of ~2,500 Garo-English parallel sentence pairs released by MWire Labs to support low-resource translation and experimentation in Northeast Indian languages.
English
Parallel Corpus
low-resource
Garo
northeast-india
tibeto-burman
generated_from_other_dataset
A'chik
  • See Upvoters0
  • Downloads21
  • File Size0
  • Views126

MWIRE LABS

Related Models Related Models

NortheastNER
NortheastNER is a token classification model built on XLM-RoBERTa and fine-tuned on ~25k sentences from gazetteers, news, and cultural texts across Northeast India. It detects region-specific entities, places, tribes, festivals, tourist sites, flora, fauna, and experimental local names; ideal for low-resource NER, regional search, cultural analytics, and knowledge graph applications.
Token Classification
NER
northeast-india
low-resource
XLM-RoBERTa
Meghalaya
Conservation
Northeast India
  • See Upvoters0
  • Downloads9
  • File Size0
  • Views117
Updated 3 month(s) ago

MWIRE LABS

NE-BERT
NE-BERT is Northeast India's first domain-specific multilingual foundation model. Built on the ModernBERT architecture and trained on 8.3 million sentences, it supports 9 regional languages: Assamese, Khasi, Garo, Manipuri (Meitei), Mizo, Nyishi, Nagamese, Kokborok, and Pnar. It achieves State-of-the-Art performance on regional benchmarks and offers 1.6x faster inference, bridging the digital divide for low-resource languages.
modernbert
Masked Language Modeling
northeast-india
low-resource-NLP
northeast bert
mwirelabs
token-efficiency
Assamese
Garo
Nyishi
Meitei
Nagamese
khasi
A'chik
Mizo
kokborok
Pnar
  • See Upvoters0
  • Downloads10
  • File Size0
  • Views307
Updated 3 month(s) ago

MWIRE LABS