Indian Flag
Government Of India
A-
A
A+
Bharat Parallel Corpus Collection (BPCC)

Bharat Parallel Corpus Collection (BPCC)

The Bharat Parallel Corpus Collection (BPCC) is a large-scale parallel corpus for machine translation across 22 Indian languages, developed by AI4Bharat.

About Dataset

The Bharat Parallel Corpus Collection (BPCC), developed by AI4Bharat at IIT Madras, is a comprehensive dataset aimed at improving machine translation for all 22 scheduled Indian languages. It includes approximately 230 million sentence pairs, combining both mined data from existing corpora and human-curated high-quality datasets. BPCC supports multilingual machine translation models like IndicTrans2 and provides evaluation benchmarks for translation quality across diverse domains.

Activity Overview Activity Overview

  • Downloads0
  • Redirect 41
  • Views 185
  • File Size 0

Tags Tags

  • Translation
  • Multilingual
  • NLP
  • Machine Translation
  • Indian Language
  • Indic language

License Control License Control

CC0 1.0 Public Domain