Northeast India's first AI language model. Kren-M is a 2.6B parameter bilingual model for Khasi-English, built on Gemma-2-2B. Features Kren-NE custom tokenizer covering 7 NE languages (Khasi, Garo, Mizo, Assamese, Manipuri, Nagamese, Nyishi) with 35.7% efficiency gain. Trained on 5.43M Khasi sentences. Capabilities: bidirectional translation, natural conversation, cultural context. Designed for language preservation across Northeast India
Kren-M™ is Northeast India's first production-ready AI language model, specifically designed for Khasi (initially) with foundational support for the broader Northeast Indian linguistic landscape. This 2.6B parameter bilingual model, built on Google's Gemma-2-2B, represents a breakthrough in AI accessibility for low-resource Indian languages, particularly those from the historically underserved Northeast region. Developed by MWire Labs in Shillong, Meghalaya, Kren-M addresses a critical gap where Northeast Indian languages, despite representing millions of speakers, have had virtually no representation in modern NLP systems. Khasi, the primary focus language, is an Austroasiatic language spoken by approximately 1.4 million people in Meghalaya. KREN-NE TOKENIZER - MULTI-LANGUAGE FOUNDATION: The model's core innovation is the Kren-NE custom tokenizer, which extends Gemma's SentencePiece vocabulary with 2,135 tokens covering SEVEN Northeast Indian languages: Khasi (kha_Latn) Garo (grt_Latn) Mizo (lus_Latn) Assamese (asm_Beng) Manipuri / Meitei (mni_Beng) Nagamese (nag_Latn) Nyishi (njz_Latn) This multi-language tokenizer architecture ensures 35.7% tokenization efficiency improvement and establishes a foundation for future Northeast Indian language models, making Kren-M not just a Khasi model but a stepping stone for regional AI development. KEY FEATURES: 2.6B parameters with extended vocabulary (258,135 tokens) Kren-NE multi-language tokenizer covering 7 NE languages 35.7% tokenization efficiency improvement over base model Khasi ↔ English translation capability (instruction-based) Natural conversational abilities in both languages Cultural context awareness. 2048 token context window BFloat16 precision (~6GB inference memory) TRAINING METHODOLOGY: Phase 1: Kren-NE Tokenizer Development: Extended Gemma's tokenizer with 2,135 subwords based on frequency analysis across Northeast Indian language corpora, with primary focus on Khasi and Garo. Phase 2: Continued Pre-Training: Trained on 5.43M cleaned Khasi sentences (~521M tokens) for 2 epochs over 4 days on NVIDIA A40. Reduced perplexity from baseline to 19.9. Phase 3: Supervised Fine-Tuning: Fine-tuned on 42,977 instruction pairs including 20K translation examples, 15K English chat, and 7,977 native Khasi conversational data using LoRA adaptation. APPLICATIONS: Language education and preservation initiatives across Northeast India Government digital services in Meghalaya Translation systems for official documents Conversational AI for civic engagement Research on endangered language technologies A foundation for future Northeast Indian language models
Attribution-Non-Commercial 4.0 International (CC BY-NC 4.0)
MWirelabs
Text Generation
PyTorch
Open
Social
19/11/25 11:57:36
0
Attribution-Non-Commercial 4.0 International (CC BY-NC 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.