Indian Flag
Government Of India
A-
A
A+
Varta - a large-scale multilingual and headline-generation dataset

Varta - a large-scale multilingual and headline-generation dataset

Varta is a large-scale multilingual dataset for headline generation and text generation tasks across Indic languages and English.

About Dataset

Varta is a diverse, challenging, large-scale, multilingual, and high-quality headline-generation dataset containing 41.8 million news articles in 14 Indic languages and English. Languages are - Assamese, Bhojpuri, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, and Urdu. It enables research in language understanding, text generation, summarization, and multilingual AI applications.The citation of the dataset - Aralikatte, Rahul, Ziling Cheng, Sumanth Doddapaneni, and Jackie Chi Kit Cheung. “Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages.” arXiv preprint arXiv:2305.05858, 2023. Available at - https://arxiv.org/abs/2305.05858. This dataset was identified and facilitated for onboarding as part of the Dataset Onboarding Support Team (DOST) initiative led by by CivicDataLab (CDL), partnering with the Gates Foundation in collaboration with BHASHINI. CivicDataLab provided technical support for dataset discovery, validation, metadata preparation and onboarding facilitation. All dataset ownership and intellectual property rights remain with the original author(s)."

Purpose of Dataset

The Purpose Of This Dataset Is Designed To Support The Development Of Multilingual Nlp Systems For Indic Languages, Particularly For Headline Generation And Text Summarization Tasks. It Can Be Used For Training Language Models, Text Generation Systems, Language Understanding Applications, And Research Involving Low-resource Indian Languages.

Activity Overview Activity Overview

  • Downloads0
  • Redirect 1
  • File Size 0
  • Views 13

Tags Tags

  • Odia
  • Marathi
  • Hindi
  • Bhojpuri
  • Assamese
  • urdu
  • telugu
  • tamil
  • malayalam
  • bengali
  • english
  • language:hi
  • language:en
  • language:gu
  • language:ml
  • language:te
  • language:ta
  • language:bn
  • language:mr
  • language:kn
  • language:as
  • language:or
  • language:pa
  • language:ur
  • language:ne
  • license:cc
  • PUNJABI
  • kannada
  • News Domain
  • Multilingual NLP
  • Indic NLP
  • nepali
  • task_categories:feature-extraction
  • task_categories:summarization
  • size_categories:1B<n<10B
  • language:bh
  • arxiv:2305.05858

License Control License Control

Attribution 4.0 International (CC BY- 4.0)