ORGANISATION

Mahanama

Mahānāma is a benchmark for Entity Discovery and Linking (EDL) in the literary domain, derived from the Mahābhārata, the world's longest epic. It marks 109K mentions of 5.5K unique entities, for which detailed descriptions are provided in an English knowledge base. The dataset serves as a unique testbed for addressing challenges of high lexical variation, ambiguous references, and long-range dependencies in morphologically rich Sanskrit.

About Dataset

Mahānāma is a large-scale dataset designed to benchmark Entity Discovery and Linking (EDL) and Coreference Resolution in complex literary environments. Derived from the Mahābhārata, the world’s longest epic, it serves as a unique testbed for models handling high lexical variation, severe ambiguity, and long-range narrative dependencies. The dataset contains over 109K named entity mentions mapped to more than 5.5K unique entities. It is explicitly designed to support cross-lingual research by including a comprehensive Knowledge Base (KB) with entity descriptions in English. Challenges: 1. Lexical Variation: Characters are referred to by hundreds of different names and epithets; for example, the protagonist Arjuna appears under 126 distinct names. 2. Ambiguity: A single name often refers to multiple distinct figures. 3. Long-range Dependencies: References to the same entity are often spread across a vast context, requiring models to maintain consistency across extended narrative arcs. 4. Morphological Complexity: Extensive inflection, compound formation, and phonetic transformations common in morphologically rich Sanskrit. Data Organization: The data is organized into 18 volumes, subdivided into chapters and subchapters. Text tokens are encoded in the Sanskrit Library Phonetic Basic Encoding Scheme (SLP1). * Marked Corpus: The annotated text is provided as JSON data mirroring the standard CoNLL-U (CorefUD) structure. * Entity Annotations: Stored in the MISC column using the global.Entity key. The format follows the structure of entity ID, entity type, head word, and identity (the canonical name chosen from variants). * Attributes: Includes the base_name (uninflected form of the matched name) and links to a digitized Index containing the entry number, English gloss, and specific volume and verse citations mapping to the Mahābhārata Calcutta Edition. Knowledge Base (KB): The accompanying Knowledge Base links every entity cluster to an English description, enabling cross-lingual understanding. Stored as a JSON file where each entry corresponds to an entity ID. Each entry contains a unique key, the original description from Sørensen's Index, a cleaned version for readability, a list of aliases (name variants), and a cluster head flag indicating if the entry is the canonical representative. For details, please see the following paper: Paper Reference Sarkar, S., Sarkar, G., Jagadeeshan, M. B., Sandhan, J., Krishna, A., & Goyal, P. (2025). Mahānāma: A Unique Testbed for Literary Entity Discovery and Linking. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Link: https://aclanthology.org/2025.emnlp-main.1269/ DOI: 10.18653/v1/2025.emnlp-main.1269

Purpose of Dataset

1. To Present Mahanama As A Large Literary Dataset For Entity Discovery And Linking In Sanskrit A Low Resource And Morphologically Rich Language. 2. To Capture The Core Challenges Of Edl Namely Extreme Lexical Variation And Ambiguity Through A Dataset Of Over 109k Annotated Mentions Mapped To 5.5k Entities. 3. To Enable Cross Lingual Linking By Providing A Knowledge Base With Entity Descriptions In English. 4. To Benchmark Current Entity Resolution Systems And Show That Literary Texts Exhibit Substantially Higher Degrees Of Variation And Ambiguity Than Existing Datasets. 5. To Highlight The Difficulty Of Resolving Context Dependent Names In Extended Narratives Where Current Models Struggle With Mention Detection, Lexical Variation, Ambiguity, And Long Range Dependencies.