ORGANISATION

IndicDLP

IndicDLP is the largest multilingual, multi-layout, multi-domain Document Layout Parsing (DLP) dataset for Indian languages, designed to advance research in document transcription and understanding

About Dataset

IndicDLP is the largest multilingual, multi-layout, multi-domain Document Layout Parsing (DLP) dataset for Indian languages, designed to advance research in document transcription and understanding. It includes 1,22,000 human-annotated document images across 11 Indian languages, spanning ten of the most common domains such as newspapers, magazines, and textbooks. IndicDLP standardizes complex layouts into structured components, facilitating document transcription pipelines and addressing the unique challenges of Indic scripts. The documents, sourced from various publicly available online repositories, include digitally born, scanned, and photographed PDFs, covering a diverse range of formats from the pre-independence era to the present day.