Indian Flag
Government Of India
A-
A
A+

KhasiBERT

Khasi language model trained on 3.6M sentences using RoBERTa architecture. 110M parameters. Supports NLP tasks for Khasi text processing.

About Model

KhasiBERT is the first transformer-based language model developed specifically for the Khasi language of Meghalaya. This foundational model addresses the computational linguistics needs for Meghalaya's 1.4+ million Khasi speakers and enables digital language processing infrastructure for the state.

The model implements the RoBERTa architecture with 12 transformer layers, 768 hidden dimensions, and 12 attention heads, totaling 110,652,416 parameters. KhasiBERT was trained using masked language modeling on a corpus of 3,621,116 Khasi sentences collected and preprocessed to represent diverse linguistic patterns of the language.

A custom Byte-Level BPE tokenizer with 32,000 vocabulary tokens was developed specifically for the Khasi language to handle its Austroasiatic linguistic characteristics effectively. Training utilized mixed-precision (FP16) optimization with a batch size of 24, learning rate of 5e-5, and AdamW optimizer over 150,880 training steps.

The model supports standard transformer operations including masked token prediction, contextualized embeddings generation, and can be fine-tuned for downstream tasks such as text classification, sentiment analysis, named entity recognition, and question answering in Khasi. Input sequences are limited to 512 tokens with standard special tokens for beginning-of-sequence, end-of-sequence, padding, unknown tokens, and masking.

Technical specifications include GELU activation functions, 0.1 dropout probability, layer normalization with epsilon 1e-12, and positional embeddings up to 514 positions. The model weights are distributed in Safetensors format and are compatible with the Transformers library ecosystem for integration with existing NLP workflows.

KhasiBERT

Metadata Metadata

Creative Commons Attribution Non Commercial 4.0

MWirelabs

Text Generation

PyTorch

Open

MWire Labs

Education and Skill Development

04/09/25 02:55:59

Badal Nyalang

0

Activity Overview Activity Overview

  • Downloads1
  • Redirect 12
  • Views 388
  • File Size 0

Tags Tags

  • Meghalaya
  • digital-india
  • safetensors
  • roberta
  • Fill-Mask
  • khasi
  • Bert
  • masked-lm
  • foundational-model
  • low-resource
  • Indian Language
  • austroasiatic
  • kha
  • autotrain_compatible
  • endpoints_compatible
  • region:us

License Control License Control

Creative Commons Attribution Non Commercial 4.0

Related Models Related Models

Kren-M
Northeast India's first AI language model. Kren-M is a 2.6B parameter bilingual model for Khasi-English, built on Gemma-2-2B. Features Kren-NE custom tokenizer covering 7 NE languages (Khasi, Garo, Mizo, Assamese, Manipuri, Nagamese, Nyishi) with 35.7% efficiency gain. Trained on 5.43M Khasi sentences. Capabilities: bidirectional translation, natural conversation, cultural context. Designed for language preservation across Northeast India
Tokenizer
low-resource
continued-pretraining
Instruction-Tuning
bilingual
Garo
Indian Languages
Northeast India
Kren-M
Northeast India Languages
Foundational model
khasi
northeast-india
  • See Upvoters0
  • Downloads10
  • File Size0
  • Views321
Updated 2 month(s) ago

MWIRE LABS

Kren v1: Khasi Generative Language Model
Kren v1 is the first Khasi generative language model, trained on 1M lines, pioneering encoder-to-decoder adaptation for low-resource AI.
khasi-culture
Indigenous Language
Northeast India
Encoder-to-Decoder
AI for Culture
Natural Language Processing
Meghalaya
MWire Labs
Low-Resource NLP
khasi
  • See Upvoters0
  • Downloads0
  • File Size390.67 MB
  • Views8
Updated 4 month(s) ago

MWIRE LABS

More Models from MWire Labs More Models from MWire Labs

Northeast Language Identification
NE-LID is a fast and accurate language identification model for Northeast Indian languages using character level features. It is designed for low resource and script diverse text and achieves high accuracy on short sentences.
fastText
MWire Labs
Multilingual
low-resource
northeast-india
fasttext
language identification
  • See Upvoters0
  • Downloads4
  • File Size0
  • Views157
Updated 1 month(s) ago

MWIRE LABS

NortheastNER
NortheastNER is a token classification model built on XLM-RoBERTa and fine-tuned on ~25k sentences from gazetteers, news, and cultural texts across Northeast India. It detects region-specific entities, places, tribes, festivals, tourist sites, flora, fauna, and experimental local names; ideal for low-resource NER, regional search, cultural analytics, and knowledge graph applications.
northeast-india
NER
Northeast India
Conservation
Meghalaya
XLM-RoBERTa
low-resource
Token Classification
  • See Upvoters0
  • Downloads6
  • File Size0
  • Views101
Updated 2 month(s) ago

MWIRE LABS

Kren-M
Northeast India's first AI language model. Kren-M is a 2.6B parameter bilingual model for Khasi-English, built on Gemma-2-2B. Features Kren-NE custom tokenizer covering 7 NE languages (Khasi, Garo, Mizo, Assamese, Manipuri, Nagamese, Nyishi) with 35.7% efficiency gain. Trained on 5.43M Khasi sentences. Capabilities: bidirectional translation, natural conversation, cultural context. Designed for language preservation across Northeast India
Garo
khasi
northeast-india
low-resource
continued-pretraining
Instruction-Tuning
bilingual
Indian Languages
Northeast India
Kren-M
Northeast India Languages
Foundational model
Tokenizer
  • See Upvoters0
  • Downloads10
  • File Size0
  • Views321
Updated 2 month(s) ago

MWIRE LABS

NE-BERT
NE-BERT is Northeast India's first domain-specific multilingual foundation model. Built on the ModernBERT architecture and trained on 8.3 million sentences, it supports 9 regional languages: Assamese, Khasi, Garo, Manipuri (Meitei), Mizo, Nyishi, Nagamese, Kokborok, and Pnar. It achieves State-of-the-Art performance on regional benchmarks and offers 1.6x faster inference, bridging the digital divide for low-resource languages.
kokborok
Pnar
modernbert
Masked Language Modeling
northeast-india
mwirelabs
token-efficiency
Assamese
Garo
Nyishi
low-resource-NLP
Meitei
Nagamese
khasi
A'chik
Mizo
northeast bert
  • See Upvoters0
  • Downloads9
  • File Size0
  • Views265
Updated 2 month(s) ago

MWIRE LABS

KhasiBERT
Khasi language model trained on 3.6M sentences using RoBERTa architecture. 110M parameters. Supports NLP tasks for Khasi text processing.
region:us
safetensors
roberta
Fill-Mask
khasi
Bert
masked-lm
foundational-model
low-resource
Indian Language
austroasiatic
kha
autotrain_compatible
endpoints_compatible
Meghalaya
digital-india
  • See Upvoters1
  • Downloads12
  • File Size0
  • Views389
Updated 5 month(s) ago

MWIRE LABS

Khasi English Semantic Search Model
Khasi-English semantic search model, trained on 66,794 pairs with 0.69-0.74 similarity. ~90MB, supports Meghalaya tourism/culture. By MWirelabs
Meghalaya
khasi-culture
text-embeddings-inference
autotrain_compatible
license:cc0-1.0
kha
en
Sentence Similarity
cross-lingual
semantic search
khasi
safetensors
sentence-transformers
  • See Upvoters0
  • Downloads14
  • File Size0
  • Views387
Updated 5 month(s) ago

MWIRE LABS