Updesh is a large-scale synthetic dataset designed to advance post-training of LLMs for Indic languages. It integrates translated reasoning data and synthesized open-domain generative content to support culturally-grounded multilingual adaptation of LLMs.
Despite the rapid progress in instruction-tuned LLMs, most existing datasets focus on English, creating a gap in high-quality, culturally grounded resources for Indic languages—resources that are essential for enabling Small Language Models (SLMs) to serve India’s diverse linguistic landscape. Updesh aims to fill this gap by providing rich, multilingual instruction-tuning data grounded in Indian languages and contexts.
Unlike previous English centric translated datasets, Updesh employs a dual approach of culturally-grounded data generation and careful, selective translation, ensuring linguistic nuance and relevance for each language.
By releasing Updesh as open data, researchers and communities working on Indian languages as well as other low-resource languages gain unprecedented access to high-quality, culturally-nuanced data.
Languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Odia, Punjabi, Tamil, Telugu, Urdu
Data Composition: Reasoning Data: ~6.8M translated tuples, Generative Data: ~2.1M synthesized tuples
Microsoft-research-license
13 files
13 files
14 files
14 files
14 files
15 files
13 files
13 files
2.40 KB
10.47 KB
This preview shows 10 out of 20 items. Load more
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.