Indian Flag
Government Of India
A-
A
A+
Santham-Parallel

Santham-Parallel

**Santham** is a high-quality, curated parallel corpus for Sanskrit-Tamil machine translation. It addresses the lack of parallel data for this language pair by providing over 90,000 parallel training sentences and 3,000 human-reviewed benchmark data. The data spans a wide range of Sanskrit literary styles, including modern prose, classical poetry, and epics.

About Dataset

Contains the primary training and benchmark translation pairs. * **`prose.tsv`**: 20,446 training pairs. Human-translated sentences from the Saṃsādhanī corpus in *unsandhied* (split) form. * **`prose_benchmark.tsv`**: 1,000 human-reviewed benchmark pairs for evaluation. * **`poetry.tsv`**: 69,703 training pairs. Automatically aligned / human-translated classical poetry (Mahābhārata, Rāmāyaṇa, Bhagavatam, etc.). * **`poetry_benchmark.tsv`**: 1,000 human-reviewed benchmark pairs for evaluation.

Purpose of Dataset

Parallel Data For Translation And Benchmark For Testing Any Sanskri-tamil Models Specifically On Poetry And Prose Text.

Activity Overview Activity Overview

  • Downloads0
  • Downloads 0
  • Views 6
  • File Size 11.20 MB

Tags Tags

  • Tamil
  • Parallel Corpus
  • Sanskrit
  • parallel sentences
  • language:tam
  • language:san
  • Sanskrit-Tamil

License Control License Control

Attribution 4.0 International (CC BY- 4.0)

santham-parallel ( 1 directories )


Directory
santham-parallel

4 files

Data Quality Score BetaData Quality Score Beta

Version Control Version Control

FolderVersion 1(11.20 MB)
  • Nagaraju V·4 day(s) ago
    • chevron_rightFolder
      santham-parallel
      • chevron_rightFolder
        santham-parallel