Khasi-English semantic search model, trained on 66,794 pairs with 0.69-0.74 similarity. ~90MB, supports Meghalaya tourism/culture. By MWirelabs
Developed by MWirelabs, this model is the first production-ready semantic search system for Khasi-English language pairs, celebrating Northeast India’s linguistic diversity, with a special focus on Meghalaya. Trained on a curated corpus of 66,794 English-Khasi translation pairs (63,909 Khasi sentences, 65,239 English sentences, 65,241 parallel pairs), it utilizes the lightweight MiniLM-L6-v2 architecture (~22.7M parameters, ~90MB). The model achieves cosine similarity scores of 0.69-0.74, showcasing effective cross-lingual alignment for Khasi, a low-resource Austroasiatic language spoken primarily in Meghalaya.
The dataset, sourced from cleaned Khasi texts, historical documents, bilingual translations, and cultural/administrative materials from Meghalaya, was preprocessed for anonymization.
Key use cases include cross-lingual document similarity, cultural content discovery (e.g., Meghalaya’s Khasi folklore), and educational tools for the region’s tourism and heritage sectors. The lightweight design supports deployment on low-resource devices, enhancing accessibility in Meghalaya. Ethical considerations emphasize respect for Khasi heritage, encouraging collaboration with Meghalaya’s local communities.
This pioneering effort by MWirelabs, released under Creative Commons CC0 1.0, positions the organization as a leader in Meghalaya and Northeast India’s AI innovation, building on the Khasi-English Word Embeddings model.
Citation: @misc{kajingiathuhsearch2025, title={KaJingïathuhSearch2025: Khasi-English Semantic Search Model}, author={MWirelabs}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/MWirelabs/khasi-english-semantic-search}} }
CC0 1.0 Public Domain
MWirelabs
Multilingual Language Model
PyTorch
Open
Arts, Culture and Tourism
18/09/25 10:17:07
0
CC0 1.0 Public Domain
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.