Indian Flag
Government Of India
A-
A
A+
IndicDLP

IndicDLP

IndicDLP is the largest multilingual, multi-layout, multi-domain Document Layout Parsing (DLP) dataset for Indian languages, designed to advance research in document transcription and understanding

About Dataset

IndicDLP is the largest multilingual, multi-layout, multi-domain Document Layout Parsing (DLP) dataset for Indian languages, designed to advance research in document transcription and understanding. It includes 1,22,000 human-annotated document images across 11 Indian languages, spanning ten of the most common domains such as newspapers, magazines, and textbooks. IndicDLP standardizes complex layouts into structured components, facilitating document transcription pipelines and addressing the unique challenges of Indic scripts. The documents, sourced from various publicly available online repositories, include digitally born, scanned, and photographed PDFs, covering a diverse range of formats from the pre-independence era to the present day.

Activity Overview Activity Overview

  • Downloads0
  • Redirect 123
  • Views 798
  • File Size 0

Tags Tags

  • OCR
  • Document Layout Parsing
  • Document Processing

License Control License Control

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)