Indian Flag
Government Of India
A-
A
A+

Large-Scale Multilingual Training Datasets for Large Language Models

AI Kosh, in collaboration with Meta and Sarvam AI, is releasing 12 billion synthetic Indic training tokens across 10 Indian languages, along with a fine-tuned 17B Llama-4 MoE model and its fine-tuning pipeline. The initiative strengthens India’s AI training infrastructure and enables scalable multilingual model development across sectors such as science, healthcare, education, and agriculture.

About Use Case

As part of its mission to strengthen India’s AI ecosystem, AI Kosh, in collaboration with Meta and Sarvam AI, is making available 12 billion high-quality synthetic Indic training tokens along with a fine-tuned 17B Llama-4 MoE model and the reproducible fine-tuning pipeline used to train it.

The release includes 14 structured datasets designed for training and fine-tuning large language models in 10 major Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, and Telugu.

The datasets are generated through large-scale synthetic data generation pipelines to augment scarce high-quality training data in low-resource Indian languages. They are structured across three real-world linguistic styles: formal language for academic and official use cases, code-mixed conversational formats such as Hinglish and Tanglish, and transliteration across Roman and native scripts.

The datasets cover a broad range of high-impact tasks, including reasoning, instruction following, conversational AI, code generation, agentic workflows, multilingual translation, creative writing, and summarization. This breadth enables the development of multilingual AI systems that can operate reliably across domains such as scientific research support, healthcare communication, educational copilots, agricultural advisory systems, citizen services, and enterprise knowledge platforms.

In addition to the datasets, the initiative includes a fine-tuned 17B Llama-4 MoE model with 17B active parameters across 16 experts and 109B total parameters. Trained on a subset of the released dataset, the model demonstrates measurable improvements in Indic language fluency, instruction following, domain-specific reasoning, and cross-lingual performance.

The model release includes training configuration details, fine-tuning scripts, preprocessing utilities, evaluation setup, and reproducibility documentation. Together, these assets provide a reference implementation for open, large-scale multilingual model training and lower the barrier for researchers, startups, enterprises, and public institutions to build high-performance AI systems aligned with India’s linguistic diversity.

Tags Tags

  • Indic Languages
  • Multilingual AI
  • LLM Training
  • Foundation Models
  • Synthetic Data
  • Model Fine-Tuning
  • Open-Source AI
  • MoE Model

Tags Sector

Sector Agnostic

Related Datasets Related Datasets

Updated 4 day(s) ago
Creative Tasks
Creative Tasks
Information
Creative writing and storytelling dataset for imaginative content generation
creative-writing
  • See Upvoters0
  • Downloads1
  • File Size100.33 MB
  • Views38

FACEBOOK INDIA ONLINE SERVICES PRIVATE LIMITED

Updated 4 day(s) ago
Crosslingual
Crosslingual
Information
Cross-lingual and multilingual translation dataset
Multilingual
translation
Evaluation
  • See Upvoters0
  • Downloads0
  • File Size185.06 MB
  • Views25

FACEBOOK INDIA ONLINE SERVICES PRIVATE LIMITED

Updated 4 day(s) ago
Deliberative Alignment
Deliberative Alignment
Information
Aligned decision-making dataset using step-by-step deliberation
decision-making
Reasoning
alignment
  • See Upvoters0
  • Downloads0
  • File Size10.32 MB
  • Views16

FACEBOOK INDIA ONLINE SERVICES PRIVATE LIMITED

Updated 4 day(s) ago
gsm8k Train
gsm8k Train
Information
Mathematical reasoning dataset for grade-school problem solving
problem-solving
Mathematics
education
  • See Upvoters0
  • Downloads0
  • File Size18.77 MB
  • Views14

FACEBOOK INDIA ONLINE SERVICES PRIVATE LIMITED

Updated 4 day(s) ago
Multiturn Code
Multiturn Code
Information
Multi-turn coding and software development dataset
Reasoning
Coding
  • See Upvoters0
  • Downloads0
  • File Size321.09 MB
  • Views17

FACEBOOK INDIA ONLINE SERVICES PRIVATE LIMITED

Updated 4 day(s) ago
Nemotron GPT oss Reasoning
Nemotron GPT oss Reasoning
Information
Advanced reasoning dataset with structured outputs
problem-solving
advanced-reasoning
  • See Upvoters0
  • Downloads0
  • File Size6.59 GB
  • Views26

FACEBOOK INDIA ONLINE SERVICES PRIVATE LIMITED

Updated 4 day(s) ago
STEM Code Reasoning
STEM Code Reasoning
Information
STEM and code reasoning dataset
Reasoning
stem
  • See Upvoters0
  • Downloads0
  • File Size1.02 GB
  • Views18

FACEBOOK INDIA ONLINE SERVICES PRIVATE LIMITED

Updated 4 day(s) ago
Teacher Student
Teacher Student
Information
Guided reasoning dataset using Socratic questioning
education
Reasoning
  • See Upvoters0
  • Downloads0
  • File Size191.74 MB
  • Views32

FACEBOOK INDIA ONLINE SERVICES PRIVATE LIMITED

Updated 4 day(s) ago
Translation Related
Translation Related
Information
Multilingual translation and content generation dataset
Multilingual
translation
  • See Upvoters0
  • Downloads0
  • File Size133.16 MB
  • Views22

FACEBOOK INDIA ONLINE SERVICES PRIVATE LIMITED

Updated 4 day(s) ago
Web Summarisation
Web Summarisation
Information
Web content summarization and information extraction dataset
Summarization
  • See Upvoters0
  • Downloads0
  • File Size28.08 MB
  • Views20

FACEBOOK INDIA ONLINE SERVICES PRIVATE LIMITED

Related Models Related Models

Meta Llama-4-Scout-17B-16E-Instruct
Llama 4 Scout 17B-16E-Instruct is a 17B-parameter, instruction-tuned multimodal model developed in collaboration with AI Kosh, Meta and Sarvam. Built on a 16-expert Mixture-of-Experts architecture, it is optimized for high-performance text generation and image understanding, delivering strong efficiency and scalable multimodal reasoning capabilities.
Multimodal
VLM
  • See Upvoters0
  • Downloads6
  • File Size162.05 GB
  • Views78
Updated 4 day(s) ago

FACEBOOK INDIA ONLINE SERVICES PRIVATE LIMITED