Indian Flag
Government Of India
A-
A
A+
Saamayik-master

Saamayik-master

Samayik: A Benchmark and Dataset for English-Sanskrit Translation

About Dataset

We release Saamayik, a dataset of around 53,000 parallel English-Sanskrit sentences, written in contemporary prose. Sanskrit is a classical language still in sustenance and has a rich documented heritage. However, due to the limited availability of digitized content, it still remains a low-resource language. Existing Sanskrit corpora, whether monolingual or bilingual, have predominantly focused on poetry and offer limited coverage of contemporary written materials. Saamayik is curated from a diverse range of domains, including language instruction material, textual teaching pedagogy, and online tutorials, among others. It stands out as a unique resource that specifically caters to the contemporary usage of Sanskrit, with a primary emphasis on prose writing. Translation models trained on our dataset demonstrate statistically significant improvements when translating out-of-domain contemporary corpora, outperforming models trained on older classical-era poetry datasets. Finally, we also release benchmark models by adapting four multilingual pre-trained models, three of them have not been previously exposed to Sanskrit for translating between English and Sanskrit while one of them is multi-lingual pre-trained translation model including English and Sanskrit. The dataset and source code can be found at https://github.com/ayushbits/saamayik.

Activity Overview Activity Overview

  • Downloads0
  • Downloads 9
  • Views 182
  • File Size 17.18 MB

Tags Tags

  • Sanskrit
  • Machine Learning
  • English-Indic translation
  • english

License Control License Control

CC0 1.0 Public Domain

Saamayik-master ( 6 files, 5 directories )


Directory
byt5_scripts

4 files

Directory
data

7 directories

Directory
indicbart_scripts

4 files

Directory
mbart_scripts

3 files

Directory
nllb

2 files

undefined
.gitignore

58 Bytes

undefined
calc-metrics.py

400 Bytes

undefined
gtranslate.py

1.18 KB

undefined
read_data.py

6.23 KB

text/markdown
README.md

1.86 KB

This preview shows 10 out of 11 items. Load more

Data Quality Score BetaData Quality Score Beta

Version Control Version Control

FolderVersion 1(17.18 MB)
  • admin·9 month(s) ago
    • chevron_rightFolder
      Saamayik-master
      • chevron_rightFolder
        byt5_scripts
      • chevron_rightFolder
        data
      • chevron_rightFolder
        indicbart_scripts
      • chevron_rightFolder
        mbart_scripts
      • chevron_rightFolder
        nllb
      • undefined
        .gitignore
      • undefined
        calc-metrics.py
      • undefined
        gtranslate.py
      • undefined
        read_data.py
      • text/markdown
        README.md
      • more_horiz 1 more