ORGANISATION

Kren-M

Northeast India's first AI language model. Kren-M is a 2.6B parameter bilingual model for Khasi-English, built on Gemma-2-2B. Features Kren-NE custom tokenizer covering 7 NE languages (Khasi, Garo, Mizo, Assamese, Manipuri, Nagamese, Nyishi) with 35.7% efficiency gain. Trained on 5.43M Khasi sentences. Capabilities: bidirectional translation, natural conversation, cultural context. Designed for language preservation across Northeast India

About Model

Kren-M™ is Northeast India's first production-ready AI language model, specifically designed for Khasi (initially) with foundational support for the broader Northeast Indian linguistic landscape. This 2.6B parameter bilingual model, built on Google's Gemma-2-2B, represents a breakthrough in AI accessibility for low-resource Indian languages, particularly those from the historically underserved Northeast region. Developed by MWire Labs in Shillong, Meghalaya, Kren-M addresses a critical gap where Northeast Indian languages, despite representing millions of speakers, have had virtually no representation in modern NLP systems. Khasi, the primary focus language, is an Austroasiatic language spoken by approximately 1.4 million people in Meghalaya. KREN-NE TOKENIZER - MULTI-LANGUAGE FOUNDATION: The model's core innovation is the Kren-NE custom tokenizer, which extends Gemma's SentencePiece vocabulary with 2,135 tokens covering SEVEN Northeast Indian languages: Khasi (kha_Latn) Garo (grt_Latn) Mizo (lus_Latn) Assamese (asm_Beng) Manipuri / Meitei (mni_Beng) Nagamese (nag_Latn) Nyishi (njz_Latn) This multi-language tokenizer architecture ensures 35.7% tokenization efficiency improvement and establishes a foundation for future Northeast Indian language models, making Kren-M not just a Khasi model but a stepping stone for regional AI development. KEY FEATURES: 2.6B parameters with extended vocabulary (258,135 tokens) Kren-NE multi-language tokenizer covering 7 NE languages 35.7% tokenization efficiency improvement over base model Khasi ↔ English translation capability (instruction-based) Natural conversational abilities in both languages Cultural context awareness. 2048 token context window BFloat16 precision (~6GB inference memory) TRAINING METHODOLOGY: Phase 1: Kren-NE Tokenizer Development: Extended Gemma's tokenizer with 2,135 subwords based on frequency analysis across Northeast Indian language corpora, with primary focus on Khasi and Garo. Phase 2: Continued Pre-Training: Trained on 5.43M cleaned Khasi sentences (~521M tokens) for 2 epochs over 4 days on NVIDIA A40. Reduced perplexity from baseline to 19.9. Phase 3: Supervised Fine-Tuning: Fine-tuned on 42,977 instruction pairs including 20K translation examples, 15K English chat, and 7,977 native Khasi conversational data using LoRA adaptation. APPLICATIONS: Language education and preservation initiatives across Northeast India Government digital services in Meghalaya Translation systems for official documents Conversational AI for civic engagement Research on endangered language technologies A foundation for future Northeast Indian language models

Kren-M

Metadata

License

Attribution-Non-Commercial 4.0 International (CC BY-NC 4.0)

Hosted By

MWirelabs

Task Type

Text Generation

Model Format

PyTorch

Visibility

Open

Source Organisation

MWire Labs

Sector

Social

Updated Date & Time

19/11/25 11:57:36

Created By

Badal Nyalang

Size

Activity Overview

License Control

Attribution-Non-Commercial 4.0 International (CC BY-NC 4.0)

Related Models

More Models from MWire Labs

NE-SpeechEmbed

NE-SpeechEmbed is a multilingual speech-text embedding model by MWire Labs for Northeast Indian languages. The model supports semantic speech search, cross-modal retrieval, and audio-text embeddings across Khasi, Garo, Mizo, Nagamese, Kokborok, Assamese, Wancho, and Chakma.

multilingual

mwire-labs

low-resource-languages

whisper

embeddings

speech-embeddings

speech-text-retrieval

audio-text-retrieval

speech-language-model

speech

XLM-RoBERTa

low-resource

retrieval

northeast-india

Updated 2 day(s) ago

MWIRE LABS

View Details

NE-Embed

NE-Embed is a multilingual text embedding model for Northeast Indian languages, enabling semantic search, retrieval, and RAG across 10 languages including Khasi, Garo, Meitei, Bodo, Mizo, Assamese, Nyishi, Kokborok, Pnar, and Nagamese. Fine-tuned on LaBSE with 201,738 parallel pairs.

retrieval

low-resource

rag

sentence-transformers

embeddings

Multilingual

northeast-india

Updated 2 day(s) ago

MWIRE LABS

View Details

Mizo OCR - Text Recognition for Mizo Language

OCR model for the Mizo language achieving 90.68% character accuracy on synthetic and curated printed text

Mizo

Image-to-Text

trocr

northeast-india

low-resource

OCR

Updated 3 month(s) ago

MWIRE LABS

View Details

NE-OCR

NE-OCR is a multilingual Optical Character Recognition model developed by MWire Labs to accurately recognize printed text from documents in Northeast Indian languages. The model supports Assamese, Bodo, English, Garo, Hindi, Khasi, Kokborok, Meitei (Bengali script), Meitei (Meitei Mayek script), Mizo, Nagamese, and Nyishi. It is designed to enable reliable digitization of books, newspapers, government records, educational materials, and cultural archives from Northeast India where mainstream OCR

Meitei

Nagamese

northeast-india

khasi

Mizo

Optical Character Recognition

BODO

doctr

vitstr

Multilingual OCR

Northeast India OCR

Printed Text Recognition

OCR

Garo

Kokborok

Nyishi

Updated 3 month(s) ago

MWIRE LABS

View Details

Nagamese Speech-to-Text

Automatic Speech Recognition (ASR) model for Nagamese speech, designed to transcribe spoken Nagamese into text for real-world usage.

whisper

low-resource-language

Speech Recognition

Nagamese

Automatic Speech Recognition

ASR

Updated 3 month(s) ago

MWIRE LABS

View Details

Garo OCR - Text Recognition for Garo

OCR model for the Garo language achieving 93.13% character accuracy.

northeast-india

florence-2

Garo

OCR

Image-to-Text

Updated 3 month(s) ago

MWIRE LABS

View Details

Northeast Language Identification

NE-LID is a fast and accurate language identification model for Northeast Indian languages using character level features. It is designed for low resource and script diverse text and achieves high accuracy on short sentences.

language identification

low-resource

northeast-india

MWire Labs

Multilingual

fasttext

fastText

Updated 5 month(s) ago

MWIRE LABS

View Details

NortheastNER

NortheastNER is a token classification model built on XLM-RoBERTa and fine-tuned on ~25k sentences from gazetteers, news, and cultural texts across Northeast India. It detects region-specific entities, places, tribes, festivals, tourist sites, flora, fauna, and experimental local names; ideal for low-resource NER, regional search, cultural analytics, and knowledge graph applications.

Conservation

Token Classification

Northeast India

Meghalaya

northeast-india

low-resource

XLM-RoBERTa

NER

Updated 7 month(s) ago

MWIRE LABS

View Details

Kren-M

Indian Languages

Tokenizer

Foundational model

Northeast India Languages

Kren-M

bilingual

continued-pretraining

Northeast India

khasi

northeast-india

Garo

low-resource

Instruction-Tuning

Updated 7 month(s) ago

MWIRE LABS

View Details

NE-BERT

NE-BERT is Northeast India's first domain-specific multilingual foundation model. Built on the ModernBERT architecture and trained on 8.3 million sentences, it supports 9 regional languages: Assamese, Khasi, Garo, Manipuri (Meitei), Mizo, Nyishi, Nagamese, Kokborok, and Pnar. It achieves State-of-the-Art performance on regional benchmarks and offers 1.6x faster inference, bridging the digital divide for low-resource languages.

Pnar

kokborok

Mizo

token-efficiency

northeast bert

mwirelabs

modernbert

A'chik

khasi

northeast-india

Nagamese

Meitei

Nyishi

Garo

Assamese

low-resource-NLP

Masked Language Modeling

Updated 7 month(s) ago

MWIRE LABS

View Details

Accessibility options by UX4G

Kren-M

About Model

Kren-M

Metadata

Activity Overview

Tags

License Control

Related Models

More Models from MWire Labs

AIKosh

Resources

Support