A collection of research papers from arXiv.org, beneficial for training models on scientific literature and technical writing.
The arXiv Dataset is a large collection of scientific papers and metadata sourced from arXiv.org, a repository operated by Cornell University. It includes titles, abstracts, authorship information, subject categories, and in some cases full text. The dataset spans multiple scientific disciplines such as physics, mathematics, computer science, and engineering, providing structured and technical academic content.
The Arxiv Dataset Is Widely Used For Training And Evaluating Language Models On Scientific And Technical Text. It Supports Tasks Such As Scientific Summarization, Citation Analysis, Topic Classification, And Technical Question Answering. Researchers Also Use It To Study Scholarly Communication And Domain-specific Language Understanding. For Llms, It Helps Improve Reasoning, Technical Vocabulary Comprehension, And Performance On Research-oriented Tasks.
Creative Commons Attribution Non Commercial 4.0
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.