NE-OCR is a multilingual Optical Character Recognition model developed by MWire Labs to accurately recognize printed text from documents in Northeast Indian languages. The model supports Assamese, Bodo, English, Garo, Hindi, Khasi, Kokborok, Meitei (Bengali script), Meitei (Meitei Mayek script), Mizo, Nagamese, and Nyishi. It is designed to enable reliable digitization of books, newspapers, government records, educational materials, and cultural archives from Northeast India where mainstream OCR
NE-OCR is a multilingual OCR system built specifically for the linguistic diversity of Northeast India. Many global OCR engines struggle with regional scripts, mixed-language documents, and low-resource languages commonly used across the region. NE-OCR addresses this gap by providing accurate printed text recognition across multiple languages and scripts used in Northeast India. The model supports the following languages: Assamese Bodo English Garo Hindi Khasi Kokborok Meitei (Bengali script) Meitei (Meitei Mayek script) Mizo Nagamese Nyishi These languages represent several major writing systems used in the region including Latin, Bengali, Devanagari, and Meitei Mayek. Documents across Northeast India frequently contain mixed-language text due to administrative, educational, and cultural practices. NE-OCR is designed to handle such multilingual documents effectively. The model is trained using a diverse dataset of printed text collected from books, newspapers, scanned documents, educational publications, and institutional records from across the region. The system uses modern transformer-based OCR architecture combining visual feature extraction with sequence recognition models to generate accurate text outputs. NE-OCR was developed by MWire Labs as part of a broader effort to build AI infrastructure for indigenous and low-resource languages of Northeast India. The goal is to enable governments, researchers, startups, and institutions to digitize documents, preserve cultural knowledge, and build language technologies on top of reliable OCR systems. Benchmark Performance NE-OCR was evaluated against widely used OCR systems including EasyOCR, Tesseract 5, TrOCR-large, and Chandra across multiple languages from Northeast India. The results show strong performance across different scripts used in the region. Selected benchmark examples: Language: Assamese Script: Bengali NE-OCR: 97.46% EasyOCR: 32.25% Tesseract 5: 8.79% TrOCR-large: 0.80% Chandra: 57.83% Language: Khasi Script: Latin NE-OCR: 98.85% EasyOCR: 77.78% Tesseract 5: 80.72% TrOCR-large: 93.22% Chandra: 94.15% Language: Meitei (Meitei Mayek) Script: Meitei Mayek NE-OCR: 95.56% EasyOCR: 2.50% Tesseract 5: 2.24% TrOCR-large: 2.45% Chandra: 2.57% Language: Mizo Script: Latin NE-OCR: 95.96% EasyOCR: 67.62% Tesseract 5: 68.44% TrOCR-large: 84.58% Chandra: 92.96% Language: Hindi Script: Devanagari NE-OCR: 97.69% EasyOCR: 49.54% Tesseract 5: 41.48% TrOCR-large: 1.27% Chandra: 85.78% Across the full benchmark covering twelve languages, NE-OCR achieved an average recognition accuracy of 94.99 percent, significantly outperforming commonly used OCR systems on regional languages. Typical use cases include digitization of historical archives, processing of government records, OCR for newspapers and educational materials, creation of datasets for language technology research, and document search or knowledge extraction systems. NE-OCR is designed to serve as a foundational component for document AI systems in Northeast India. By enabling reliable OCR across regional languages and scripts, the model helps unlock large volumes of printed knowledge that remain difficult to digitize with existing OCR tools. The project is developed and maintained by MWire Labs as part of its mission to advance artificial intelligence technologies for the languages of Northeast India.
Attribution 4.0 International (CC BY- 4.0)
MWirelabs
Transformers
PyTorch
Open
Sector Agnostic
07/03/26 12:14:44
0
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.