ORGANISATION

Santham-Parallel

**Santham** is a high-quality, curated parallel corpus for Sanskrit-Tamil machine translation. It addresses the lack of parallel data for this language pair by providing over 90,000 parallel training sentences and 3,000 human-reviewed benchmark data. The data spans a wide range of Sanskrit literary styles, including modern prose, classical poetry, and epics.

About Dataset

Contains the primary training and benchmark translation pairs. * **`prose.tsv`**: 20,446 training pairs. Human-translated sentences from the Saṃsādhanī corpus in *unsandhied* (split) form. * **`prose_benchmark.tsv`**: 1,000 human-reviewed benchmark pairs for evaluation. * **`poetry.tsv`**: 69,703 training pairs. Automatically aligned / human-translated classical poetry (Mahābhārata, Rāmāyaṇa, Bhagavatam, etc.). * **`poetry_benchmark.tsv`**: 1,000 human-reviewed benchmark pairs for evaluation.