Detect language from provided text, Currently supports 22 languages
IndicLID, is a language identifier for all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. IndicLID is the first LID for romanized text in Indian languages. It is a two stage classifier that is ensemble of a fast linear classifier and a slower classifier finetuned from a pre-trained LM. It can predict 47 classes (24 native-script classes and 21 roman-script classes plus English and Others). IndicLID is evaluated on Bhasha-Abhijnaanam benchmark which is released alnog with this work. For native-script text, IndicLID has better language coverage than existing LIDs and is competitive or better than other LIDs. IndicLID model is 10 times faster and 4 times smaller than the NLLB model also establish a strong baseline results on the roman-script text.
MIT
AI4Bharat
Text Language Detection
N.A.
Open
Sector Agnostic
21/02/25 13:21:38
0
MIT
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.