OCR model for the Mizo language achieving 90.68% character accuracy on synthetic and curated printed text
MizoOCR is an Optical Character Recognition model for Mizo, a Tibeto-Burman language spoken by over 800,000 people in Mizoram, Northeast India. Built on TrOCR (microsoft/trocr-base-printed) and fine-tuned on a deduplicated dataset of 70,000 image-text pairs combining synthetic renders and curated samples, the model achieves 89.61% validation and 90.68% test character accuracy. MizoOCR correctly handles Mizo's unique diacritical characters (â, ê, î, ô, û) which cause failures in existing generic OCR systems. Developed by MWire Labs as part of the Northeast India OCR initiative to bring document digitization capabilities to underrepresented indigenous languages of the region.
Attribution 4.0 International (CC BY- 4.0)
MWirelabs
OCR (Optical Character Recognition) Model
Transformers
Open
Sector Agnostic
25/02/26 12:33:50
0
Attribution 4.0 International (CC BY- 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.