
The OVA Odia Prose Literature Dataset is a curated collection of sentence-level text extracted from 1,143 Odia books digitized by the Odia Virtual Academy (OVA). It spans multiple domains including prose, culture, autobiographies, biography, travel writing, plays, criticism, short story collections, essays, religion and philosophy, scientific writing, and history. The dataset is developed to support language modelling, NLP research and generative AI training.
The OVA Odia Prose Literature Dataset is a curated compilation of sentence-level text drawn from 1,143 Odia books digitized by the Odia Virtual Academy (OVA). It has been developed to serve as a structured, machine-learning-ready resource for natural language processing, linguistic research, and generative AI development in Odia. The dataset covers a broad range of domains, including prose, culture, autobiographies, biography, travel writing, plays, criticism, short story collections, essays, religion and philosophy, scientific writing, and history, providing a wide representation of Odia’s literary and intellectual traditions. The dataset brings together works spanning different periods and writing styles, offering a diverse view of Odia language usage. By extracting content at the sentence level, the dataset aligns with the requirements of modern NLP models that benefit from clean and consistent input units. This structure enables direct use in tasks such as language modeling, translation, summarization, and text generation, as well as analytical tasks that require segmented and standardized text. The variety of source domains contributes to the richness of linguistic patterns within the dataset. It reflects narrative writing, analytical exposition, reflective prose, conversational text, descriptive passages, historical narration, and technical explanation. This mixture helps models and researchers access a more complete picture of Odia as it appears across literature, scholarship, personal writing, and documentation. The presence of texts from different genres allows the dataset to capture differences in vocabulary, tone, sentence construction, and stylistic form, which is important for building AI systems designed to handle real-world usage rather than narrow subsets of the language.
The Purpose Of This Dataset Is To Provide Sentence-level Odia Text, Extracted From Digitized Books And Segmented Using Odia Danda And Question-mark Delimiters, To Support Ai Training. By Transforming Each Book Into A Clean Csv File With Individual Sentences As Rows, The Dataset Enables Language Modeling, Text Processing, And Other Nlp Tasks That Require Structured, High-quality Odia Textual Data.
Attribution 4.0 International (CC BY- 4.0)
1 directories
1 directories
1 directories
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.