NE-LID is a fast and accurate language identification model for Northeast Indian languages using character level features. It is designed for low resource and script diverse text and achieves high accuracy on short sentences.
NE LID is a sentence level language identification model developed for low resource languages of Northeast India. The model supports ten languages including Assamese, Bodo, Garo, Hindi, Khasi, Kokborok, Meitei, Mizo, Naga and Nyishi. The model is trained using a fastText supervised classifier with character n gram features which makes it robust to spelling variation short text and multiple writing scripts. A balanced dataset with two thousand sentences per language was used with a stratified train dev test split. Extensive evaluation shows that the model achieves around 99% accuracy and outperforms transformer based language models for this task. Experiments indicate that character level approaches are more effective than subword based transformers for language identification in low resource and script diverse settings. The model is suitable for use in language routing data filtering preprocessing for machine translation speech recognition and other downstream language technology applications in the Northeast India context.
Attribution 4.0 International (CC BY- 4.0)
MWirelabs
Classification Model
Other
Open
Social
14/01/26 15:04:29
0
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.