Indian Flag
Government Of India
A-
A
A+
Updesh

Updesh

Updesh is a large-scale synthetic dataset designed to advance post-training of LLMs for 13 Indian languages

About Dataset

  • Updesh is a large-scale synthetic dataset designed to advance post-training of LLMs for Indic languages. It integrates translated reasoning data and synthesized open-domain generative content to support culturally-grounded multilingual adaptation of LLMs.

  • Despite the rapid progress in instruction-tuned LLMs, most existing datasets focus on English, creating a gap in high-quality, culturally grounded resources for Indic languages—resources that are essential for enabling Small Language Models (SLMs) to serve India’s diverse linguistic landscape. Updesh aims to fill this gap by providing rich, multilingual instruction-tuning data grounded in Indian languages and contexts.

  • Unlike previous English centric translated datasets, Updesh employs a dual approach of culturally-grounded data generation and careful, selective translation, ensuring linguistic nuance and relevance for each language.

  • By releasing Updesh as open data, researchers and communities working on Indian languages as well as other low-resource languages gain unprecedented access to high-quality, culturally-nuanced data.

  • Languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Odia, Punjabi, Tamil, Telugu, Urdu

  • Data Composition: Reasoning Data: ~6.8M translated tuples, Generative Data: ~2.1M synthesized tuples

Activity Overview Activity Overview

  • Downloads0
  • Downloads 29
  • Views 730
  • File Size 16.21 GB

Tags Tags

  • NLP Dataset
  • Reasoning
  • Instruction-Tuning
  • Indian Language
  • llm
  • fine-tuning

License Control License Control

Microsoft-research-license

Updesh_beta ( 3 files, 17 directories )


Directory
analytical_reasoning

13 files

Directory
brain_teaser

13 files

Directory
causal_reasoning

14 files

Directory
creative_writing

14 files

Directory
cultural_multihop_reasoning

14 files

Directory
dialog_gen

15 files

Directory
fermi

13 files

Directory
fs_cot_flow

13 files

undefined
.gitattributes

2.40 KB

text/markdown
LICENSE.md

10.47 KB

This preview shows 10 out of 20 items. Load more

Data Quality Score BetaData Quality Score Beta

Version Control Version Control

FolderVersion 1(16.21 GB)
  • admin·7 month(s) ago
    • chevron_rightFolder
      Updesh_beta
      • chevron_rightFolder
        analytical_reasoning
      • chevron_rightFolder
        brain_teaser
      • chevron_rightFolder
        causal_reasoning
      • chevron_rightFolder
        creative_writing
      • chevron_rightFolder
        cultural_multihop_reasoning
      • chevron_rightFolder
        dialog_gen
      • chevron_rightFolder
        fermi
      • chevron_rightFolder
        fs_cot_flow
      • undefined
        .gitattributes
      • text/markdown
        LICENSE.md
      • more_horiz 10 more