OVA Odia Literature Dataset v1

This dataset is a curated monolingual corpus of Odia literary texts prepared from books digitized by the Odia Virtual Academy (OVA). The dataset contains sentence-level extractions from multiple books processed into clean, machine learning ready text files.

About Dataset

This dataset is a curated monolingual corpus of Odia literary texts derived from books digitized by the Odia Virtual Academy (OVA). Each digitized book has been processed into a clean, UTF-8 encoded, machine-learning-ready CSV file in which every row represents a single sentence. Sentences are extracted using Odia-specific punctuation rules, primarily split by the Odia danda (।) and the question mark (?), ensuring linguistically consistent segmentation. By compiling diverse literary works into a uniform, sentence-level structure, this UTF-8–encoded dataset provides a robust foundation for tasks such as language modeling, translation, text classification, and broader computational studies of the Odia language.

Purpose of Dataset

The Primary Purpose Of This Dataset Is To Support The Development And Evaluation Of Natural Language Processing Tools And Linguistic Research For The Odia Language, Enabling Tasks Such As Language Modeling, Machine Translation, Text Classification.