NE-BERT is Northeast India's first domain-specific multilingual foundation model. Built on the ModernBERT architecture and trained on 8.3 million sentences, it supports 9 regional languages: Assamese, Khasi, Garo, Manipuri (Meitei), Mizo, Nyishi, Nagamese, Kokborok, and Pnar. It achieves State-of-the-Art performance on regional benchmarks and offers 1.6x faster inference, bridging the digital divide for low-resource languages.
NE-BERT is a state-of-the-art transformer model designed specifically for the complex, low-resource linguistic landscape of Northeast India. While generic multilingual models (like mBERT or IndicBERT) often underperform on Austroasiatic and Tibeto-Burman languages due to data scarcity, NE-BERT treats these languages as first-class citizens. Supported Languages: The model is trained on a curated corpus of ~8.3 million sentences covering: 1. Assamese [asm] (Indo-Aryan) 2. Manipuri/Meitei [mni] (Tibeto-Burman) 3. Khasi [kha] (Austroasiatic) 4. Mizo [lus] (Tibeto-Burman) 5. Garo [grt] (Tibeto-Burman) 6. Nyishi [njz] (Tibeto-Burman) 7. Nagamese Creole [nag] (Indo-Aryan) 8. Kokborok [trp] (Tibeto-Burman) 9. Pnar [pbv] (Austroasiatic) Technical Key Features: * Architecture: ModernBERT-Base (Pre-Norm, Rotary Embeddings, Flash Attention 2) * Context Window: Supports up to 8,192 tokens * Efficiency: 1.6x more efficient tokenization than mBERT * Performance: Low perplexity on Austroasiatic and Tibeto-Burman languages
Attribution 4.0 International (CC BY- 4.0)
MWirelabs
Transformers
PyTorch
Open
Arts, Culture and Tourism
22/11/25 13:08:10
0
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.