Home/Models/Khasi English Semantic Search Model

ORGANISATION

Khasi English Semantic Search Model

Khasi-English semantic search model, trained on 66,794 pairs with 0.69-0.74 similarity. ~90MB, supports Meghalaya tourism/culture. By MWirelabs

MWire Labs
Badalnyalang

About Model

Developed by MWirelabs, this model is the first production-ready semantic search system for Khasi-English language pairs, celebrating Northeast India’s linguistic diversity, with a special focus on Meghalaya. Trained on a curated corpus of 66,794 English-Khasi translation pairs (63,909 Khasi sentences, 65,239 English sentences, 65,241 parallel pairs), it utilizes the lightweight MiniLM-L6-v2 architecture (~22.7M parameters, ~90MB). The model achieves cosine similarity scores of 0.69-0.74, showcasing effective cross-lingual alignment for Khasi, a low-resource Austroasiatic language spoken primarily in Meghalaya.

The dataset, sourced from cleaned Khasi texts, historical documents, bilingual translations, and cultural/administrative materials from Meghalaya, was preprocessed for anonymization.

Key use cases include cross-lingual document similarity, cultural content discovery (e.g., Meghalaya’s Khasi folklore), and educational tools for the region’s tourism and heritage sectors. The lightweight design supports deployment on low-resource devices, enhancing accessibility in Meghalaya. Ethical considerations emphasize respect for Khasi heritage, encouraging collaboration with Meghalaya’s local communities.

This pioneering effort by MWirelabs, released under Creative Commons CC0 1.0, positions the organization as a leader in Meghalaya and Northeast India’s AI innovation, building on the Khasi-English Word Embeddings model.

Citation: @misc{kajingiathuhsearch2025, title={KaJingïathuhSearch2025: Khasi-English Semantic Search Model}, author={MWirelabs}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/MWirelabs/khasi-english-semantic-search}} }

Khasi English Semantic Search Model

Metadata

License

CC0 1.0 Public Domain

Hosted By

MWirelabs

Model Type

Multilingual Language Model

Model Format

PyTorch

Visibility

Open

Source Organisation

MWire Labs

Sector

Arts, Culture and Tourism

Updated Date & Time

18/09/25 10:17:07

Created By

Badal Nyalang

Size

Activity Overview

License Control

CC0 1.0 Public Domain

Related Models

More Models from MWire Labs

Mizo OCR - Text Recognition for Mizo Language

OCR model for the Mizo language achieving 90.68% character accuracy on synthetic and curated printed text

Mizo

northeast-india

trocr

low-resource

OCR

Image-to-Text

Updated 3 month(s) ago

MWIRE LABS

View Details

NE-OCR

NE-OCR is a multilingual Optical Character Recognition model developed by MWire Labs to accurately recognize printed text from documents in Northeast Indian languages. The model supports Assamese, Bodo, English, Garo, Hindi, Khasi, Kokborok, Meitei (Bengali script), Meitei (Meitei Mayek script), Mizo, Nagamese, and Nyishi. It is designed to enable reliable digitization of books, newspapers, government records, educational materials, and cultural archives from Northeast India where mainstream OCR

Printed Text Recognition

Northeast India OCR

Multilingual OCR

vitstr

doctr

BODO

Optical Character Recognition

Mizo

khasi

northeast-india

Nagamese

Meitei

Nyishi

Kokborok

Garo

OCR

Updated 3 month(s) ago

MWIRE LABS

View Details

Nagamese Speech-to-Text

Automatic Speech Recognition (ASR) model for Nagamese speech, designed to transcribe spoken Nagamese into text for real-world usage.

whisper

Speech Recognition

ASR

Automatic Speech Recognition

low-resource-language

Nagamese

Updated 3 month(s) ago

MWIRE LABS

View Details

Garo OCR - Text Recognition for Garo

OCR model for the Garo language achieving 93.13% character accuracy.

Garo

florence-2

northeast-india

OCR

Image-to-Text

Updated 3 month(s) ago

MWIRE LABS

View Details

Northeast Language Identification

NE-LID is a fast and accurate language identification model for Northeast Indian languages using character level features. It is designed for low resource and script diverse text and achieves high accuracy on short sentences.

Multilingual

language identification

low-resource

northeast-india

MWire Labs

fasttext

fastText

Updated 5 month(s) ago

MWIRE LABS

View Details

NortheastNER

NortheastNER is a token classification model built on XLM-RoBERTa and fine-tuned on ~25k sentences from gazetteers, news, and cultural texts across Northeast India. It detects region-specific entities, places, tribes, festivals, tourist sites, flora, fauna, and experimental local names; ideal for low-resource NER, regional search, cultural analytics, and knowledge graph applications.

Conservation

Token Classification

NER

XLM-RoBERTa

low-resource

northeast-india

Meghalaya

Northeast India

Updated 7 month(s) ago

MWIRE LABS

View Details

Kren-M

Northeast India's first AI language model. Kren-M is a 2.6B parameter bilingual model for Khasi-English, built on Gemma-2-2B. Features Kren-NE custom tokenizer covering 7 NE languages (Khasi, Garo, Mizo, Assamese, Manipuri, Nagamese, Nyishi) with 35.7% efficiency gain. Trained on 5.43M Khasi sentences. Capabilities: bidirectional translation, natural conversation, cultural context. Designed for language preservation across Northeast India

Northeast India Languages

Tokenizer

Foundational model

Kren-M

bilingual

continued-pretraining

Northeast India

khasi

northeast-india

Garo

low-resource

Instruction-Tuning

Indian Languages

Updated 7 month(s) ago

MWIRE LABS

View Details

NE-BERT

NE-BERT is Northeast India's first domain-specific multilingual foundation model. Built on the ModernBERT architecture and trained on 8.3 million sentences, it supports 9 regional languages: Assamese, Khasi, Garo, Manipuri (Meitei), Mizo, Nyishi, Nagamese, Kokborok, and Pnar. It achieves State-of-the-Art performance on regional benchmarks and offers 1.6x faster inference, bridging the digital divide for low-resource languages.

kokborok

Mizo

token-efficiency

northeast bert

low-resource-NLP

Masked Language Modeling

Assamese

Garo

Nyishi

Meitei

Nagamese

northeast-india

khasi

A'chik

modernbert

mwirelabs

Pnar

Updated 7 month(s) ago

MWIRE LABS

View Details

KhasiBERT

Khasi language model trained on 3.6M sentences using RoBERTa architecture. 110M parameters. Supports NLP tasks for Khasi text processing.

austroasiatic

roberta

foundational-model

digital-india

Bert

safetensors

endpoints_compatible

autotrain_compatible

Fill-Mask

Indian Language

low-resource

region:us

kha

khasi

Meghalaya

masked-lm

1
31
0
1,022

Updated 9 month(s) ago

MWIRE LABS

View Details

Khasi English Semantic Search Model

Khasi-English semantic search model, trained on 66,794 pairs with 0.69-0.74 similarity. ~90MB, supports Meghalaya tourism/culture. By MWirelabs

Meghalaya

safetensors

Sentence Similarity

cross-lingual

autotrain_compatible

semantic search

sentence-transformers

license:cc0-1.0

kha

khasi

text-embeddings-inference

khasi-culture

Updated 10 month(s) ago

MWIRE LABS

View Details

Accessibility options by UX4G

Khasi English Semantic Search Model

About Model

Khasi English Semantic Search Model

Metadata

Activity Overview

Tags

License Control

Related Models

More Models from MWire Labs

AIKosh

Resources

Support