KhasiBERT

Khasi language model trained on 3.6M sentences using RoBERTa architecture. 110M parameters. Supports NLP tasks for Khasi text processing.

MWire Labs
Badalnyalang

About Model

KhasiBERT is the first transformer-based language model developed specifically for the Khasi language of Meghalaya. This foundational model addresses the computational linguistics needs for Meghalaya's 1.4+ million Khasi speakers and enables digital language processing infrastructure for the state.

The model implements the RoBERTa architecture with 12 transformer layers, 768 hidden dimensions, and 12 attention heads, totaling 110,652,416 parameters. KhasiBERT was trained using masked language modeling on a corpus of 3,621,116 Khasi sentences collected and preprocessed to represent diverse linguistic patterns of the language.

A custom Byte-Level BPE tokenizer with 32,000 vocabulary tokens was developed specifically for the Khasi language to handle its Austroasiatic linguistic characteristics effectively. Training utilized mixed-precision (FP16) optimization with a batch size of 24, learning rate of 5e-5, and AdamW optimizer over 150,880 training steps.

The model supports standard transformer operations including masked token prediction, contextualized embeddings generation, and can be fine-tuned for downstream tasks such as text classification, sentiment analysis, named entity recognition, and question answering in Khasi. Input sequences are limited to 512 tokens with standard special tokens for beginning-of-sequence, end-of-sequence, padding, unknown tokens, and masking.

Technical specifications include GELU activation functions, 0.1 dropout probability, layer normalization with epsilon 1e-12, and positional embeddings up to 514 positions. The model weights are distributed in Safetensors format and are compatible with the Transformers library ecosystem for integration with existing NLP workflows.

KhasiBERT

Metadata

License

Creative Commons Attribution Non Commercial 4.0

Hosted By

MWirelabs

Model Type

Text Generation

Model Format

PyTorch

Visibility

Open

Source organisation

MWire Labs

Sector

Education and Skill Development

Updated Date & Time

04/09/25 02:55:59

Created By

Badal Nyalang

Size

Activity Overview

License Control

Creative Commons Attribution Non Commercial 4.0

Related Models

More Models from MWire Labs

Northeast Language Identification

NE-LID is a fast and accurate language identification model for Northeast Indian languages using character level features. It is designed for low resource and script diverse text and achieves high accuracy on short sentences.

fastText

MWire Labs

Multilingual

low-resource

northeast-india

fasttext

language identification

Updated 1 month(s) ago

MWIRE LABS

View Details

NortheastNER

NortheastNER is a token classification model built on XLM-RoBERTa and fine-tuned on ~25k sentences from gazetteers, news, and cultural texts across Northeast India. It detects region-specific entities, places, tribes, festivals, tourist sites, flora, fauna, and experimental local names; ideal for low-resource NER, regional search, cultural analytics, and knowledge graph applications.

northeast-india

NER

Northeast India

Conservation

Meghalaya

XLM-RoBERTa

low-resource

Token Classification

Updated 2 month(s) ago

MWIRE LABS

View Details

Kren-M

Northeast India's first AI language model. Kren-M is a 2.6B parameter bilingual model for Khasi-English, built on Gemma-2-2B. Features Kren-NE custom tokenizer covering 7 NE languages (Khasi, Garo, Mizo, Assamese, Manipuri, Nagamese, Nyishi) with 35.7% efficiency gain. Trained on 5.43M Khasi sentences. Capabilities: bidirectional translation, natural conversation, cultural context. Designed for language preservation across Northeast India

Garo

khasi

northeast-india

low-resource

continued-pretraining

Instruction-Tuning

bilingual

Indian Languages

Northeast India

Kren-M

Northeast India Languages

Foundational model

Tokenizer

Updated 2 month(s) ago

MWIRE LABS

View Details

NE-BERT

NE-BERT is Northeast India's first domain-specific multilingual foundation model. Built on the ModernBERT architecture and trained on 8.3 million sentences, it supports 9 regional languages: Assamese, Khasi, Garo, Manipuri (Meitei), Mizo, Nyishi, Nagamese, Kokborok, and Pnar. It achieves State-of-the-Art performance on regional benchmarks and offers 1.6x faster inference, bridging the digital divide for low-resource languages.

kokborok

Pnar

modernbert

Masked Language Modeling

northeast-india

mwirelabs

token-efficiency

Assamese

Garo

Nyishi

low-resource-NLP

Meitei

Nagamese

khasi

A'chik

Mizo

northeast bert

Updated 2 month(s) ago

MWIRE LABS

View Details

KhasiBERT

Khasi language model trained on 3.6M sentences using RoBERTa architecture. 110M parameters. Supports NLP tasks for Khasi text processing.

region:us

safetensors

roberta

Fill-Mask

khasi

Bert

masked-lm

foundational-model

low-resource

Indian Language

austroasiatic

kha

autotrain_compatible

endpoints_compatible

Meghalaya

digital-india

Updated 5 month(s) ago

MWIRE LABS

View Details

Khasi English Semantic Search Model

Khasi-English semantic search model, trained on 66,794 pairs with 0.69-0.74 similarity. ~90MB, supports Meghalaya tourism/culture. By MWirelabs

Meghalaya

khasi-culture

text-embeddings-inference

autotrain_compatible

license:cc0-1.0

kha

Sentence Similarity

cross-lingual

semantic search

khasi

safetensors

sentence-transformers

Updated 5 month(s) ago

MWIRE LABS

View Details

Accessibility options by UX4G

KhasiBERT

About Model

KhasiBERT

Metadata

Activity Overview

Tags

License Control

Related Models

More Models from MWire Labs

AIKosh

Resources

Support