
EKA Pretraining Indic Corpus v1 — a multilingual Indic dataset under Project EKA for India’s 120B foundation model.
Overview
Dataset name: EKA_PretrainingIndicCorpus_v1
Release: Version 1 — initial public release
Languages: All Indic languages including English
Format: CSV (uri and text)
Scale: Multi-billion-token text corpus aggregated until October 2025
License: Open Source
Supported by: IndiaAI Mission, Ministry of Electronics & IT, Government of India
Contributors: Indian Institute of Technology Gandhinagar and Soket AI Labs under Project EKA
About Project EKA
EKA is India’s open-source AI initiative aimed at building a sovereign, large-scale foundation model ecosystem rooted in Indic languages and values.
It represents a national effort to develop AI systems that are open, multilingual, ethical, energy-efficient, and community-driven.
Through EKA, we are constructing a transparent and inclusive foundation model stack — from datasets and benchmarks to training infrastructure — ensuring that India’s AI capabilities are built in India, for India, and for the world.
Key pillars of Project EKA include:
• Developing foundational models across languages and modalities
• Building high-quality, bias-aware datasets representative of India’s linguistic and cultural diversity
• Promoting open research and collaboration across academia and industry
• Advancing energy-efficient and climate-aware AI training
• Ensuring responsible, ethical, and safe AI deployment
About Soket AI
Soket AI is an Indian frontier AI lab focused on building next-generation multilingual foundation models and ethical AI systems.
We design models, datasets, and APIs that enable enterprises, developers, and researchers to build human-aligned intelligence efficiently and responsibly.
Soket’s mission is to ensure that AI made in India reflects India’s diversity, languages, and values while contributing to the global AI commons.
As one of the core builders of Project EKA, Soket AI drives dataset creation, model training, and infrastructure development for large-scale Indic models.
About IIT Gandhinagar
The Indian Institute of Technology Gandhinagar (IITGn) leads the academic and research partnership for Project EKA.
The institute contributes its expertise in data curation, computational linguistics, and large-scale machine learning, ensuring that the corpus meets rigorous research and linguistic standards.
Data Sources
The corpus is aggregated from multiple open and public Indic domains, including:
• National and regional news media
• Educational and research publications
• Government and institutional releases
• Open-web text sources
• Social and cultural materials available in the public domain
• Domain-specific texts from agriculture, health, education, and technology sectors
All content has been collected with the objective of maximising linguistic coverage, topical diversity, and cross-domain representativeness.
Purpose and Use
The EKA Indic Large Text Corpus serves as the foundational dataset for:
• Pretraining and fine-tuning multilingual Indic large language models (LLMs)
• Developing cross-lingual transfer, translation, and summarization systems
• Creating domain-specific Indic NLP models for education, agriculture, and governance
• Benchmarking and evaluating Indian-language LLMs
• Enabling open and inclusive AI research across academia and startups
This corpus will power the training of the upcoming 120 B parameter EKA Foundation Model, designed to represent India’s linguistic and cultural landscape at scale.
Roadmap
This is the first release in a multi-stage dataset roadmap.
Future versions will expand and refine the corpus with:
• Additional Indic languages and regional dialects
• Domain-balanced data distributions
• Improved metadata coverage
• Multimodal (speech and image) extensions
• Benchmarks and evaluation splits for downstream tasks
• Continuous integration with EKA’s model training pipelines
The long-term goal is to build a multi-trillion-token dataset powering India’s open-source foundation models.
Note: A sample of the dataset is available to download as a single archive file. The entire dataset can be fetched from individual CSV files.
Attribution-Non-Commercial 4.0 International (CC BY-NC 4.0)
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.