ORGANISATION

RedPajama-Data

Open-source replication of the LLaMA training set.

About Dataset

RedPajama-Data is an open-source dataset created as a transparent replication of the data mixture used to train large-scale language models such as LLaMA. It aggregates text from multiple public and permissively accessible sources, including web pages, code repositories, books, academic papers, and community discussions. The dataset is carefully documented to ensure reproducibility and transparency in large language model training. RedPajama emphasizes openness by clearly identifying data sources and collection processes, enabling researchers to understand the composition of training data at scale.

Purpose of Dataset

Redpajama-data Is Designed To Support The Open Development And Reproducibility Of Large Language Models. Its Primary Use Is For Training And Pretraining Llms In Academic, Research, And Open-source Settings. By Providing A Well-documented Alternative To Proprietary Training Datasets, It Enables Researchers To Study Scaling Behavior, Data Mixture Effects, And Model Performance. The Dataset Is Also Useful For Benchmarking And Comparing Open Models Against Closed Systems, Fostering Transparency And Responsible Ai Research.