Indian Flag
Government Of India
A-
A
A+

NE-OCR

NE-OCR is a multilingual Optical Character Recognition model developed by MWire Labs to accurately recognize printed text from documents in Northeast Indian languages. The model supports Assamese, Bodo, English, Garo, Hindi, Khasi, Kokborok, Meitei (Bengali script), Meitei (Meitei Mayek script), Mizo, Nagamese, and Nyishi. It is designed to enable reliable digitization of books, newspapers, government records, educational materials, and cultural archives from Northeast India where mainstream OCR

About Model

NE-OCR is a multilingual OCR system built specifically for the linguistic diversity of Northeast India. Many global OCR engines struggle with regional scripts, mixed-language documents, and low-resource languages commonly used across the region. NE-OCR addresses this gap by providing accurate printed text recognition across multiple languages and scripts used in Northeast India. The model supports the following languages: Assamese Bodo English Garo Hindi Khasi Kokborok Meitei (Bengali script) Meitei (Meitei Mayek script) Mizo Nagamese Nyishi These languages represent several major writing systems used in the region including Latin, Bengali, Devanagari, and Meitei Mayek. Documents across Northeast India frequently contain mixed-language text due to administrative, educational, and cultural practices. NE-OCR is designed to handle such multilingual documents effectively. The model is trained using a diverse dataset of printed text collected from books, newspapers, scanned documents, educational publications, and institutional records from across the region. The system uses modern transformer-based OCR architecture combining visual feature extraction with sequence recognition models to generate accurate text outputs. NE-OCR was developed by MWire Labs as part of a broader effort to build AI infrastructure for indigenous and low-resource languages of Northeast India. The goal is to enable governments, researchers, startups, and institutions to digitize documents, preserve cultural knowledge, and build language technologies on top of reliable OCR systems. Benchmark Performance NE-OCR was evaluated against widely used OCR systems including EasyOCR, Tesseract 5, TrOCR-large, and Chandra across multiple languages from Northeast India. The results show strong performance across different scripts used in the region. Selected benchmark examples: Language: Assamese Script: Bengali NE-OCR: 97.46% EasyOCR: 32.25% Tesseract 5: 8.79% TrOCR-large: 0.80% Chandra: 57.83% Language: Khasi Script: Latin NE-OCR: 98.85% EasyOCR: 77.78% Tesseract 5: 80.72% TrOCR-large: 93.22% Chandra: 94.15% Language: Meitei (Meitei Mayek) Script: Meitei Mayek NE-OCR: 95.56% EasyOCR: 2.50% Tesseract 5: 2.24% TrOCR-large: 2.45% Chandra: 2.57% Language: Mizo Script: Latin NE-OCR: 95.96% EasyOCR: 67.62% Tesseract 5: 68.44% TrOCR-large: 84.58% Chandra: 92.96% Language: Hindi Script: Devanagari NE-OCR: 97.69% EasyOCR: 49.54% Tesseract 5: 41.48% TrOCR-large: 1.27% Chandra: 85.78% Across the full benchmark covering twelve languages, NE-OCR achieved an average recognition accuracy of 94.99 percent, significantly outperforming commonly used OCR systems on regional languages. Typical use cases include digitization of historical archives, processing of government records, OCR for newspapers and educational materials, creation of datasets for language technology research, and document search or knowledge extraction systems. NE-OCR is designed to serve as a foundational component for document AI systems in Northeast India. By enabling reliable OCR across regional languages and scripts, the model helps unlock large volumes of printed knowledge that remain difficult to digitize with existing OCR tools. The project is developed and maintained by MWire Labs as part of its mission to advance artificial intelligence technologies for the languages of Northeast India.

NE-OCR

Metadata Metadata

Attribution 4.0 International (CC BY- 4.0)

MWirelabs

Transformers

PyTorch

Open

MWire Labs

Sector Agnostic

07/03/26 12:14:44

Badal Nyalang

0

Activity Overview Activity Overview

  • Downloads0
  • Redirect 1
  • Views 27
  • File Size 0

Tags Tags

  • OCR
  • northeast-india
  • doctr
  • vitstr
  • Mizo
  • Garo
  • khasi
  • Nyishi
  • Kokborok
  • Nagamese
  • BODO
  • Meitei
  • Optical Character Recognition
  • Multilingual OCR
  • Northeast India OCR
  • Printed Text Recognition

License Control License Control

Attribution 4.0 International (CC BY- 4.0)

Related Models Related Models

Northeast Language Identification
NE-LID is a fast and accurate language identification model for Northeast Indian languages using character level features. It is designed for low resource and script diverse text and achieves high accuracy on short sentences.
language identification
fasttext
northeast-india
low-resource
Multilingual
MWire Labs
fastText
  • See Upvoters0
  • Downloads9
  • File Size0
  • Views324
Updated 2 month(s) ago

MWIRE LABS

More Models from MWire Labs More Models from MWire Labs

Mizo OCR - Text Recognition for Mizo Language
OCR model for the Mizo language achieving 90.68% character accuracy on synthetic and curated printed text
OCR
low-resource
Image-to-Text
trocr
northeast-india
Mizo
  • See Upvoters0
  • Downloads1
  • File Size0
  • Views18
Updated 2 day(s) ago

MWIRE LABS

NE-OCR
NE-OCR is a multilingual Optical Character Recognition model developed by MWire Labs to accurately recognize printed text from documents in Northeast Indian languages. The model supports Assamese, Bodo, English, Garo, Hindi, Khasi, Kokborok, Meitei (Bengali script), Meitei (Meitei Mayek script), Mizo, Nagamese, and Nyishi. It is designed to enable reliable digitization of books, newspapers, government records, educational materials, and cultural archives from Northeast India where mainstream OCR
Mizo
Garo
khasi
Nyishi
Kokborok
Nagamese
Printed Text Recognition
Northeast India OCR
Multilingual OCR
Optical Character Recognition
Meitei
BODO
OCR
northeast-india
doctr
vitstr
  • See Upvoters0
  • Downloads1
  • File Size0
  • Views28
Updated 2 day(s) ago

MWIRE LABS

Nagamese Speech-to-Text
Automatic Speech Recognition (ASR) model for Nagamese speech, designed to transcribe spoken Nagamese into text for real-world usage.
ASR
Speech Recognition
low-resource-language
Nagamese
whisper
Automatic Speech Recognition
  • See Upvoters0
  • Downloads0
  • File Size0
  • Views10
Updated 2 day(s) ago

MWIRE LABS

Garo OCR - Text Recognition for Garo
OCR model for the Garo language achieving 93.13% character accuracy.
florence-2
Garo
northeast-india
Image-to-Text
OCR
  • See Upvoters0
  • Downloads0
  • File Size0
  • Views17
Updated 2 day(s) ago

MWIRE LABS

Northeast Language Identification
NE-LID is a fast and accurate language identification model for Northeast Indian languages using character level features. It is designed for low resource and script diverse text and achieves high accuracy on short sentences.
fasttext
fastText
language identification
MWire Labs
Multilingual
low-resource
northeast-india
  • See Upvoters0
  • Downloads9
  • File Size0
  • Views324
Updated 2 month(s) ago

MWIRE LABS

NortheastNER
NortheastNER is a token classification model built on XLM-RoBERTa and fine-tuned on ~25k sentences from gazetteers, news, and cultural texts across Northeast India. It detects region-specific entities, places, tribes, festivals, tourist sites, flora, fauna, and experimental local names; ideal for low-resource NER, regional search, cultural analytics, and knowledge graph applications.
Northeast India
Token Classification
NER
northeast-india
low-resource
XLM-RoBERTa
Meghalaya
Conservation
  • See Upvoters0
  • Downloads9
  • File Size0
  • Views174
Updated 3 month(s) ago

MWIRE LABS

Kren-M
Northeast India's first AI language model. Kren-M is a 2.6B parameter bilingual model for Khasi-English, built on Gemma-2-2B. Features Kren-NE custom tokenizer covering 7 NE languages (Khasi, Garo, Mizo, Assamese, Manipuri, Nagamese, Nyishi) with 35.7% efficiency gain. Trained on 5.43M Khasi sentences. Capabilities: bidirectional translation, natural conversation, cultural context. Designed for language preservation across Northeast India
bilingual
Instruction-Tuning
continued-pretraining
low-resource
northeast-india
khasi
Tokenizer
Foundational model
Northeast India Languages
Kren-M
Northeast India
Indian Languages
Garo
  • See Upvoters0
  • Downloads19
  • File Size0
  • Views515
Updated 3 month(s) ago

MWIRE LABS

NE-BERT
NE-BERT is Northeast India's first domain-specific multilingual foundation model. Built on the ModernBERT architecture and trained on 8.3 million sentences, it supports 9 regional languages: Assamese, Khasi, Garo, Manipuri (Meitei), Mizo, Nyishi, Nagamese, Kokborok, and Pnar. It achieves State-of-the-Art performance on regional benchmarks and offers 1.6x faster inference, bridging the digital divide for low-resource languages.
Pnar
modernbert
Masked Language Modeling
northeast-india
low-resource-NLP
northeast bert
mwirelabs
token-efficiency
Assamese
Garo
Nyishi
Meitei
Nagamese
khasi
A'chik
Mizo
kokborok
  • See Upvoters0
  • Downloads16
  • File Size0
  • Views396
Updated 3 month(s) ago

MWIRE LABS

KhasiBERT
Khasi language model trained on 3.6M sentences using RoBERTa architecture. 110M parameters. Supports NLP tasks for Khasi text processing.
Meghalaya
roberta
Fill-Mask
khasi
Bert
masked-lm
foundational-model
low-resource
Indian Language
austroasiatic
kha
autotrain_compatible
endpoints_compatible
region:us
digital-india
safetensors
  • See Upvoters1
  • Downloads19
  • File Size0
  • Views626
Updated 6 month(s) ago

MWIRE LABS

Khasi English Semantic Search Model
Khasi-English semantic search model, trained on 66,794 pairs with 0.69-0.74 similarity. ~90MB, supports Meghalaya tourism/culture. By MWirelabs
khasi-culture
text-embeddings-inference
autotrain_compatible
license:cc0-1.0
kha
en
Sentence Similarity
cross-lingual
semantic search
khasi
safetensors
sentence-transformers
Meghalaya
  • See Upvoters0
  • Downloads21
  • File Size0
  • Views549
Updated 6 month(s) ago

MWIRE LABS