Indian Flag
Government Of India
A-
A
A+

Kren-M

Northeast India's first AI language model. Kren-M is a 2.6B parameter bilingual model for Khasi-English, built on Gemma-2-2B. Features Kren-NE custom tokenizer covering 7 NE languages (Khasi, Garo, Mizo, Assamese, Manipuri, Nagamese, Nyishi) with 35.7% efficiency gain. Trained on 5.43M Khasi sentences. Capabilities: bidirectional translation, natural conversation, cultural context. Designed for language preservation across Northeast India

About Model

Kren-M™ is Northeast India's first production-ready AI language model, specifically designed for Khasi (initially) with foundational support for the broader Northeast Indian linguistic landscape. This 2.6B parameter bilingual model, built on Google's Gemma-2-2B, represents a breakthrough in AI accessibility for low-resource Indian languages, particularly those from the historically underserved Northeast region. Developed by MWire Labs in Shillong, Meghalaya, Kren-M addresses a critical gap where Northeast Indian languages, despite representing millions of speakers, have had virtually no representation in modern NLP systems. Khasi, the primary focus language, is an Austroasiatic language spoken by approximately 1.4 million people in Meghalaya. KREN-NE TOKENIZER - MULTI-LANGUAGE FOUNDATION: The model's core innovation is the Kren-NE custom tokenizer, which extends Gemma's SentencePiece vocabulary with 2,135 tokens covering SEVEN Northeast Indian languages: Khasi (kha_Latn) Garo (grt_Latn) Mizo (lus_Latn) Assamese (asm_Beng) Manipuri / Meitei (mni_Beng) Nagamese (nag_Latn) Nyishi (njz_Latn) This multi-language tokenizer architecture ensures 35.7% tokenization efficiency improvement and establishes a foundation for future Northeast Indian language models, making Kren-M not just a Khasi model but a stepping stone for regional AI development. KEY FEATURES: 2.6B parameters with extended vocabulary (258,135 tokens) Kren-NE multi-language tokenizer covering 7 NE languages 35.7% tokenization efficiency improvement over base model Khasi ↔ English translation capability (instruction-based) Natural conversational abilities in both languages Cultural context awareness. 2048 token context window BFloat16 precision (~6GB inference memory) TRAINING METHODOLOGY: Phase 1: Kren-NE Tokenizer Development: Extended Gemma's tokenizer with 2,135 subwords based on frequency analysis across Northeast Indian language corpora, with primary focus on Khasi and Garo. Phase 2: Continued Pre-Training: Trained on 5.43M cleaned Khasi sentences (~521M tokens) for 2 epochs over 4 days on NVIDIA A40. Reduced perplexity from baseline to 19.9. Phase 3: Supervised Fine-Tuning: Fine-tuned on 42,977 instruction pairs including 20K translation examples, 15K English chat, and 7,977 native Khasi conversational data using LoRA adaptation. APPLICATIONS: Language education and preservation initiatives across Northeast India Government digital services in Meghalaya Translation systems for official documents Conversational AI for civic engagement Research on endangered language technologies A foundation for future Northeast Indian language models

Kren-M

Metadata Metadata

Attribution-Non-Commercial 4.0 International (CC BY-NC 4.0)

MWirelabs

Text Generation

PyTorch

Open

MWire Labs

Social

19/11/25 11:57:36

Badal Nyalang

0

Activity Overview Activity Overview

  • Downloads0
  • Redirect 11
  • Views 368
  • File Size 0

Tags Tags

  • khasi
  • northeast-india
  • low-resource
  • continued-pretraining
  • Instruction-Tuning
  • bilingual
  • Garo
  • Indian Languages
  • Northeast India
  • Kren-M
  • Northeast India Languages
  • Foundational model
  • Tokenizer

License Control License Control

Attribution-Non-Commercial 4.0 International (CC BY-NC 4.0)

Related Models Related Models

KhasiBERT
Khasi language model trained on 3.6M sentences using RoBERTa architecture. 110M parameters. Supports NLP tasks for Khasi text processing.
Meghalaya
digital-india
safetensors
roberta
Fill-Mask
khasi
Bert
masked-lm
foundational-model
low-resource
Indian Language
austroasiatic
kha
autotrain_compatible
endpoints_compatible
region:us
  • See Upvoters1
  • Downloads16
  • File Size0
  • Views468
Updated 5 month(s) ago

MWIRE LABS

More Models from MWire Labs More Models from MWire Labs

Northeast Language Identification
NE-LID is a fast and accurate language identification model for Northeast Indian languages using character level features. It is designed for low resource and script diverse text and achieves high accuracy on short sentences.
fastText
MWire Labs
Multilingual
low-resource
northeast-india
fasttext
language identification
  • See Upvoters0
  • Downloads6
  • File Size0
  • Views229
Updated 1 month(s) ago

MWIRE LABS

NortheastNER
NortheastNER is a token classification model built on XLM-RoBERTa and fine-tuned on ~25k sentences from gazetteers, news, and cultural texts across Northeast India. It detects region-specific entities, places, tribes, festivals, tourist sites, flora, fauna, and experimental local names; ideal for low-resource NER, regional search, cultural analytics, and knowledge graph applications.
northeast-india
NER
Northeast India
Conservation
Meghalaya
XLM-RoBERTa
low-resource
Token Classification
  • See Upvoters0
  • Downloads9
  • File Size0
  • Views116
Updated 3 month(s) ago

MWIRE LABS

Kren-M
Northeast India's first AI language model. Kren-M is a 2.6B parameter bilingual model for Khasi-English, built on Gemma-2-2B. Features Kren-NE custom tokenizer covering 7 NE languages (Khasi, Garo, Mizo, Assamese, Manipuri, Nagamese, Nyishi) with 35.7% efficiency gain. Trained on 5.43M Khasi sentences. Capabilities: bidirectional translation, natural conversation, cultural context. Designed for language preservation across Northeast India
Garo
khasi
northeast-india
low-resource
continued-pretraining
Instruction-Tuning
bilingual
Indian Languages
Northeast India
Kren-M
Northeast India Languages
Foundational model
Tokenizer
  • See Upvoters0
  • Downloads11
  • File Size0
  • Views369
Updated 3 month(s) ago

MWIRE LABS

NE-BERT
NE-BERT is Northeast India's first domain-specific multilingual foundation model. Built on the ModernBERT architecture and trained on 8.3 million sentences, it supports 9 regional languages: Assamese, Khasi, Garo, Manipuri (Meitei), Mizo, Nyishi, Nagamese, Kokborok, and Pnar. It achieves State-of-the-Art performance on regional benchmarks and offers 1.6x faster inference, bridging the digital divide for low-resource languages.
kokborok
Pnar
modernbert
Masked Language Modeling
northeast-india
mwirelabs
token-efficiency
Assamese
Garo
Nyishi
low-resource-NLP
Meitei
Nagamese
khasi
A'chik
Mizo
northeast bert
  • See Upvoters0
  • Downloads10
  • File Size0
  • Views306
Updated 3 month(s) ago

MWIRE LABS

KhasiBERT
Khasi language model trained on 3.6M sentences using RoBERTa architecture. 110M parameters. Supports NLP tasks for Khasi text processing.
region:us
safetensors
roberta
Fill-Mask
khasi
Bert
masked-lm
foundational-model
low-resource
Indian Language
austroasiatic
kha
autotrain_compatible
endpoints_compatible
Meghalaya
digital-india
  • See Upvoters1
  • Downloads16
  • File Size0
  • Views468
Updated 5 month(s) ago

MWIRE LABS

Khasi English Semantic Search Model
Khasi-English semantic search model, trained on 66,794 pairs with 0.69-0.74 similarity. ~90MB, supports Meghalaya tourism/culture. By MWirelabs
Meghalaya
khasi-culture
text-embeddings-inference
autotrain_compatible
license:cc0-1.0
kha
en
Sentence Similarity
cross-lingual
semantic search
khasi
safetensors
sentence-transformers
  • See Upvoters0
  • Downloads15
  • File Size0
  • Views436
Updated 5 month(s) ago

MWIRE LABS