
A large-scale handwritten word dataset for 10 Indian languages, supporting OCR and multilingual handwriting recognition research
The IIIT-INDIC-HW-WORDS dataset is a large-scale collection of handwritten word images spanning 13 major Indian languages. It has been curated to support research in handwriting recognition, OCR (Optical Character Recognition), and multilingual document analysis.
Languages Covered: Bengali, Hindi, Gujarati, Kannada , Malayalam , Odiya , Punjabi (Gurmukhi) Tamil , Telugu, Urdu.
Content: Word-level handwritten samples contributed by diverse native speakers, ensuring variations in handwriting style, stroke patterns, and writing speed.
Format: Each word is stored as an image with its corresponding Unicode ground truth.
Scale: The dataset includes hundreds of thousands of word images, making it one of the most comprehensive handwritten corpora for Indian scripts.
Purpose:
Training and benchmarking handwriting recognition models.
Developing multilingual OCR systems.
Supporting cross-lingual and script-independent handwriting research.
Applications: Digital archiving, document digitization, educational tools, accessibility technologies, and AI-driven handwriting analysis.
This dataset addresses the diversity and complexity of Indian scripts (abugida structure, conjunct consonants, diacritics, and multiple zones) and serves as a valuable resource for the research community.
Attribution 4.0 International (CC BY- 4.0)
No File(s) Found!
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.