Indian Flag
Government Of India
A-
A
A+

Bhashini-AI4Bharat Textual Language Detection v1.0

Detect language from provided text, Currently supports 23 languages (English, Bangla, Manipuri, Bodo, Konkani, Oriya, Nepali, Marathi, Sindhi, Sanskrit, Malayalam, Urdu, Assamese, Telugu, Dogri, Gujarati, Kashmiri, Punjabi, Santali, Maithili, Hindi, Tamil, Kannada)

About Model

IndicLID, is a language identifier for all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. IndicLID is the first LID for romanized text in Indian languages. It is a two stage classifier that is ensemble of a fast linear classifier and a slower classifier finetuned from a pre-trained LM. It can predict 47 classes (24 native-script classes and 21 roman-script classes plus English and Others). IndicLID is evaluated on Bhasha-Abhijnaanam benchmark which is released alnog with this work. For native-script text, IndicLID has better language coverage than existing LIDs and is competitive or better than other LIDs. IndicLID model is 10 times faster and 4 times smaller than the NLLB model also establish a strong baseline results on the roman-script text.

Bhashini-AI4Bharat Textual Language Detection v1.0

Metadata Metadata

MIT

AI4Bharat

OCR (Optical Character Recognition) Model

Other

Open

Sector Agnostic

05/03/25 15:21:43

Admin

3 MB

Activity Overview Activity Overview

  • Downloads3
  • Downloads 218
  • Views 3,920
  • File Size 3 MB

Tags Tags

  • Multilingual
  • AI4Bharat
  • NLP
  • Bhashini
  • Text Processing
  • Deep Learning
  • Transformer
  • Text Language Detection

License Control License Control

MIT

Version Control Version Control

FolderVersion 2(3 MB)
  • admin·1 year(s) ago
    • chevron_rightFolder
      Benchmark
      • undefined
        compile_final_pilot_1.py
      • undefined
        create_benchmark_extra.py
      • undefined
        create_benchmark.py
    • chevron_rightFolder
      deployement
    • chevron_rightFolder
      filter_Dakshina
    • chevron_rightFolder
      final_runs_ACL_inference
    • chevron_rightFolder
      final_runs_train
    • chevron_rightFolder
      Inference
    • chevron_rightFolder
      nueral_net
    • chevron_rightFolder
      preprocess_indiccorp
    • text/markdown
      README.md

More Models from TechCorp More Models from TechCorp

SANTHAM-Gemma3-4B-SH-Seg-Poetry-Finetuned
SANTHAM-Gemma3-4B-SH-Seg-Poetry-Finetuned is a model designed to translate Sanskrit into Tamil specialized on Segmented text obtained using Sanskrit Heritage segmenter.
translation
poetry
santham
Segmened
language:tam
language:san
  • See Upvoters0
  • Downloads0
  • File Size115.62 MB
  • Views55
Updated 25 day(s) ago

DIGITAL INDIA BHASHINI DIVISION

SANTHAM-Gemma3-4B-Finetuned
SANTHAM-Gemma3-4B-Finetuned is a Sanskrit → Tamil translation model built on the Gemma 3 (4B) architecture. It is trained on a parallel corpus developed as part of the Sanskrit Knowledge Accessor project, enabling it to capture linguistic nuances and generate fluent Tamil translations from classical Sanskrit inputs.
translation
language:san
language:tam
santham
  • See Upvoters0
  • Downloads2
  • File Size2.08 GB
  • Views80
Updated 25 day(s) ago

DIGITAL INDIA BHASHINI DIVISION

SANTHAM-Gemma3-4B-Anvaya-Poetry-Finetuned
SANTHAM-Gemma3-4B-Anvaya-Potery-Finetuned is a model designed to translate Sanskrit into Tamil specialized on Anvaya translation in Poetry.
poetry
santham
anvaya
language:tam
language:san
translation
  • See Upvoters0
  • Downloads2
  • File Size2.09 GB
  • Views48
Updated 25 day(s) ago

DIGITAL INDIA BHASHINI DIVISION

SPRING-INX-DATA2VEC-AQC-URDU
Automatic Speech Recognition (ASR) model for speech recognition, processing audio and transcribing spoken content into text.
spring_lab
Data2vec_aqc
low-resource-language
SSL_finetunning
ssl
urdu
IITM
  • See Upvoters0
  • Downloads0
  • File Size3.52 GB
  • Views71
Updated 1 month(s) ago

DIGITAL INDIA BHASHINI DIVISION

SPRING-INX-DATA2VEC-AQC-TELUGU
Automatic Speech Recognition (ASR) model for speech recognition, processing audio and transcribing spoken content into text.
spring_lab
low-resource-language
SSL_finetunning
Data2vec_aqc
IITM
telugu
ssl
  • See Upvoters0
  • Downloads0
  • File Size3.52 GB
  • Views59
Updated 1 month(s) ago

DIGITAL INDIA BHASHINI DIVISION

SPRING-INX-DATA2VEC-AQC-TAMIL
Automatic Speech Recognition (ASR) model for speech recognition, processing audio and transcribing spoken content into text.
low-resource-language
SSL_finetunning
Data2vec_aqc
spring_lab
IITM
tamil
ssl
  • See Upvoters0
  • Downloads1
  • File Size3.52 GB
  • Views58
Updated 1 month(s) ago

DIGITAL INDIA BHASHINI DIVISION

SPRING-INX-DATA2VEC-AQC-BENGALI
Automatic Speech Recognition (ASR) model for speech recognition, processing audio and transcribing spoken content into text.
Data2vec_aqc
IITM
spring_lab
ssl
low-resource-languages
SSL_finetunning
bengali
  • See Upvoters0
  • Downloads0
  • File Size3.52 GB
  • Views98
Updated 2 month(s) ago

DIGITAL INDIA BHASHINI DIVISION

SPRING-INX-DATA2VEC-AQC-BODO
Automatic Speech Recognition (ASR) model for speech recognition, processing audio and transcribing spoken content into text.
IITM
spring_lab
SSL_finetunning
low-resource-language
BODO
Data2vec_aqc
ssl
  • See Upvoters0
  • Downloads0
  • File Size3.52 GB
  • Views124
Updated 2 month(s) ago

DIGITAL INDIA BHASHINI DIVISION

SPRING-INX-DATA2VEC-AQC-BHOJPURI
Automatic Speech Recognition (ASR) model for speech recognition, processing audio and transcribing spoken content into text.
SSL_finetunning
Data2vec_aqc
spring_lab
IITM
ssl
Bhojpuri
low-resource-language
  • See Upvoters0
  • Downloads4
  • File Size3.52 GB
  • Views115
Updated 2 month(s) ago

DIGITAL INDIA BHASHINI DIVISION

SPRING-INX-DATA2VEC-AQC-MALAYALAM
Automatic Speech Recognition (ASR) model for speech recognition, processing audio and transcribing spoken content into text.
SSL_finetunning
ssl
malayalam
IITM
spring_lab
Data2vec_aqc
low-resource-language
  • See Upvoters0
  • Downloads1
  • File Size3.52 GB
  • Views124
Updated 2 month(s) ago

DIGITAL INDIA BHASHINI DIVISION