Indian Flag
Government Of India
A-
A
A+
EKA Pretraining Indic Corpus v1

EKA Pretraining Indic Corpus v1

EKA Pretraining Indic Corpus v1 — a multilingual Indic dataset under Project EKA for India’s 120B foundation model.

About Dataset

Overview

Dataset name: EKA_PretrainingIndicCorpus_v1

Release: Version 1 — initial public release

Languages: All Indic languages including English

Format: CSV (uri and text)

Scale: Multi-billion-token text corpus aggregated until October 2025

License: Open Source

Supported by: IndiaAI Mission, Ministry of Electronics & IT, Government of India

Contributors: Indian Institute of Technology Gandhinagar and Soket AI Labs under Project EKA


About Project EKA

EKA is India’s open-source AI initiative aimed at building a sovereign, large-scale foundation model ecosystem rooted in Indic languages and values.

It represents a national effort to develop AI systems that are open, multilingual, ethical, energy-efficient, and community-driven.


Through EKA, we are constructing a transparent and inclusive foundation model stack — from datasets and benchmarks to training infrastructure — ensuring that India’s AI capabilities are built in India, for India, and for the world.


Key pillars of Project EKA include:

 Developing foundational models across languages and modalities

 Building high-quality, bias-aware datasets representative of India’s linguistic and cultural diversity

 Promoting open research and collaboration across academia and industry

 Advancing energy-efficient and climate-aware AI training

 Ensuring responsible, ethical, and safe AI deployment


About Soket AI

Soket AI is an Indian frontier AI lab focused on building next-generation multilingual foundation models and ethical AI systems.

We design models, datasets, and APIs that enable enterprises, developers, and researchers to build human-aligned intelligence efficiently and responsibly.

Soket’s mission is to ensure that AI made in India reflects India’s diversity, languages, and values while contributing to the global AI commons.

As one of the core builders of Project EKA, Soket AI drives dataset creation, model training, and infrastructure development for large-scale Indic models.


About IIT Gandhinagar

The Indian Institute of Technology Gandhinagar (IITGn) leads the academic and research partnership for Project EKA.

The institute contributes its expertise in data curation, computational linguistics, and large-scale machine learning, ensuring that the corpus meets rigorous research and linguistic standards.


Data Sources

The corpus is aggregated from multiple open and public Indic domains, including:

 National and regional news media

 Educational and research publications

 Government and institutional releases

 Open-web text sources

 Social and cultural materials available in the public domain

 Domain-specific texts from agriculture, health, education, and technology sectors


All content has been collected with the objective of maximising linguistic coverage, topical diversity, and cross-domain representativeness.


Purpose and Use

The EKA Indic Large Text Corpus serves as the foundational dataset for:

 Pretraining and fine-tuning multilingual Indic large language models (LLMs)

 Developing cross-lingual transfertranslation, and summarization systems

 Creating domain-specific Indic NLP models for education, agriculture, and governance

 Benchmarking and evaluating Indian-language LLMs

 Enabling open and inclusive AI research across academia and startups


This corpus will power the training of the upcoming 120 B parameter EKA Foundation Model, designed to represent India’s linguistic and cultural landscape at scale.


Roadmap

This is the first release in a multi-stage dataset roadmap.

Future versions will expand and refine the corpus with:

 Additional Indic languages and regional dialects

 Domain-balanced data distributions

 Improved metadata coverage

 Multimodal (speech and image) extensions

 Benchmarks and evaluation splits for downstream tasks

 Continuous integration with EKA’s model training pipelines


The long-term goal is to build a multi-trillion-token dataset powering India’s open-source foundation models.


Note: A sample of the dataset is available to download as a single archive file. The entire dataset can be fetched from individual CSV files.

Activity Overview Activity Overview

  • Downloads0
  • Downloads 73
  • Views 3,060
  • File Size 458.23 GB

Tags Tags

  • text
  • llm
  • indic
  • pretraining

License Control License Control

Attribution-Non-Commercial 4.0 International (CC BY-NC 4.0)

No Record(s) Found

Select a file to preview its contents.

Data Quality Score BetaData Quality Score Beta

Version Control Version Control

FolderVersion 1(458.23 GB)
  • admin·3 month(s) ago
    • text/csv
      chunk_0_part_0.csv
    • text/csv
      chunk_0_part_1.csv
    • text/csv
      chunk_0_part_2.csv
    • text/csv
      chunk_1_part_0.csv
    • text/csv
      chunk_1_part_1.csv
    • text/csv
      chunk_1_part_2.csv
    • text/csv
      chunk_10_part_0.csv
    • text/csv
      chunk_10_part_1.csv
    • text/csv
      chunk_10_part_2.csv
    • text/csv
      chunk_11_part_0.csv
    • more_horiz 52 more

Related Datasets Related Datasets

Updated 6 month(s) ago
bhasha‑wiki-Urdu
bhasha‑wiki-Urdu
Information-
Urdu uses the Perso-Arabic script and is spoken in parts of North India and widely in Pakistan,
NLP Dataset
multi-modal language resources
language research
natural language processing (NLP)
indicnlp
multilingual NLP
Urdu
  • See Upvoters0
  • Downloads5
  • File Size0
  • Views89

SOKET LABS TECHNOLOGY AND RESEARCH PRIVATE LIMITED

Updated 6 month(s) ago
bhasha‑wiki-Tamil
bhasha‑wiki-Tamil
Information-
Tamil, one of the world’s oldest surviving classical languages, is written in the Tamil script and is spoken widely in Tamil Nadu
language research
cross-lingual NLP
multilingual NLP
indicnlp
natural language processing (NLP)
language-diversity
multi-modal language resources
Tamil
  • See Upvoters0
  • Downloads8
  • File Size0
  • Views102

SOKET LABS TECHNOLOGY AND RESEARCH PRIVATE LIMITED

Updated 6 month(s) ago
bhasha‑wiki-Hindi
bhasha‑wiki-Hindi
Information-
Hindi is written in the Devanagari script and is the most widely spoken language in India,
cross-lingual NLP
multilingual NLP
indicnlp
natural language processing (NLP)
Hindi
language research
  • See Upvoters0
  • Downloads24
  • File Size0
  • Views146

SOKET LABS TECHNOLOGY AND RESEARCH PRIVATE LIMITED

Updated 6 month(s) ago
bhasha‑wiki-Gujrati
bhasha‑wiki-Gujrati
Information-
Gujarati is written in the Gujarati script and is widely spoken in the western Indian state of Gujarat
Gujarati
multilingual NLP
indicnlp
natural language processing (NLP)
language-diversity
language
  • See Upvoters2
  • Downloads10
  • File Size0
  • Views200

SOKET LABS TECHNOLOGY AND RESEARCH PRIVATE LIMITED

Updated 6 month(s) ago
bhasha‑wiki-English
bhasha‑wiki-English
Information-
English provides a rich and diverse base of Wikipedia content
language research
English
cross-lingual NLP
multilingual NLP
indicnlp
natural language processing (NLP)
language
  • See Upvoters0
  • Downloads16
  • File Size0
  • Views94

SOKET LABS TECHNOLOGY AND RESEARCH PRIVATE LIMITED

Updated 6 month(s) ago
bhasha‑wiki-Bengali
bhasha‑wiki-Bengali
Information-
Bengali, or Bangla, uses the Bengali script and is predominantly spoken in West Bengal, Tripura, and Bangladesh.
natural language processing (NLP)
multi-modal language resources
Bengali
indicnlp
multilingual corpus
  • See Upvoters0
  • Downloads11
  • File Size0
  • Views152

SOKET LABS TECHNOLOGY AND RESEARCH PRIVATE LIMITED

Updated 6 month(s) ago
bhasha‑wiki-Kannada
bhasha‑wiki-Kannada
Information-
Kannada is written in the Kannada script and is spoken in the southern Indian state of Karnataka.
multilingual NLP
Kannada
multi-modal language resources
language research
natural language processing (NLP)
indicnlp
cross-lingual NLP
  • See Upvoters0
  • Downloads8
  • File Size0
  • Views102

SOKET LABS TECHNOLOGY AND RESEARCH PRIVATE LIMITED