Indian Flag
Government Of India
A-
A
A+
IndicMSMARCO

IndicMSMARCO

A comprehensive multilingual variant of MS MARCO for Indian languages, featuring select queries and corresponding passages with high-quality translations.

About Dataset

📊 Dataset Overview

  • Total Samples: 12,999
  • Languages: 13 languages
  • Source: MS MARCO development set
  • Quality: Human-verified translations
  • Task: Information Retrieval / Passage Ranking

🎯 Key Features

  • Topic Diversity: Science, history, politics, health, technology
  • Query Complexity: Simple factual, descriptive, and complex entity-based queries
  • Balanced Representation: Short, medium, and long-form queries
  • High-Quality Translations: Professional translation and verification
  • Consistent Structure: Normalized schema across all languages

📋 Available Languages (13 total)

Code Language Load Command Sample Count
as Assamese load_dataset('ai4bharat/IndicMSMARCO', 'as') ~999
bn Bengali load_dataset('ai4bharat/IndicMSMARCO', 'bn') ~999
gu Gujarati load_dataset('ai4bharat/IndicMSMARCO', 'gu') ~999
hi Hindi load_dataset('ai4bharat/IndicMSMARCO', 'hi') ~999
kn Kannada load_dataset('ai4bharat/IndicMSMARCO', 'kn') ~999
ml Malayalam load_dataset('ai4bharat/IndicMSMARCO', 'ml') ~999
mr Marathi load_dataset('ai4bharat/IndicMSMARCO', 'mr') ~999
ne Nepali load_dataset('ai4bharat/IndicMSMARCO', 'ne') ~999
or Odia load_dataset('ai4bharat/IndicMSMARCO', 'or') ~999
pa Punjabi load_dataset('ai4bharat/IndicMSMARCO', 'pa') ~999
ta Tamil load_dataset('ai4bharat/IndicMSMARCO', 'ta') ~999
te Telugu load_dataset('ai4bharat/IndicMSMARCO', 'te') ~999
ur Urdu load_dataset('ai4bharat/IndicMSMARCO', 'ur') ~999

Activity Overview Activity Overview

  • Downloads0
  • Redirect 4
  • Views 27
  • File Size 0

Tags Tags

  • Multilingual
  • Indian Languages
  • rag
  • msmarco
  • indic
  • retrieval

License Control License Control

MIT