**Santham** is a high-quality, curated parallel corpus for Sanskrit-Tamil machine translation. It addresses the lack of parallel data for this language pair by providing over 90,000 parallel training sentences and 3,000 human-reviewed benchmark data. The data spans a wide range of Sanskrit literary styles, including modern prose, classical poetry, and epics.
Contains the primary training and benchmark translation pairs. * **`prose.tsv`**: 20,446 training pairs. Human-translated sentences from the Saṃsādhanī corpus in *unsandhied* (split) form. * **`prose_benchmark.tsv`**: 1,000 human-reviewed benchmark pairs for evaluation. * **`poetry.tsv`**: 69,703 training pairs. Automatically aligned / human-translated classical poetry (Mahābhārata, Rāmāyaṇa, Bhagavatam, etc.). * **`poetry_benchmark.tsv`**: 1,000 human-reviewed benchmark pairs for evaluation.
Parallel Data For Translation And Benchmark For Testing Any Sanskri-tamil Models Specifically On Poetry And Prose Text.
Attribution 4.0 International (CC BY- 4.0)
4 files
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.