Indian Flag
Government Of India
A-
A
A+
PHINC

PHINC

Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

About Dataset

PHINC (Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation):  The dataset tackles challenges in translating noisy, informal, code-mixed social media text, offering 13,738 Hinglish-English sentence pairs manually annotated by 54 annotators for low-resource machine translation task.

The dataset contains the following fields:

Hinglish Code-Mixed Sentence: The original sentence in Romanized Hindi-English (Hinglish).
Human Translated English Sentence: The corresponding English translation provided by human annotators.

Dataset Description:

- Curated by: Lingo Research Group at IIT Gandhinagar
- Language(s) (NLP): Bilingual (Hindi [hi], English [en])
- Licensed by: cc-by-4.0

Citation:

If you use this dataset, please cite the following work:
@inproceedings{srivastava-singh-2020-phinc,
    title = "{PHINC}: A Parallel {H}inglish Social Media Code-Mixed Corpus for Machine Translation",
    author = "Srivastava, Vivek  and
      Singh, Mayank",
    booktitle = "Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wnut-1.7/",
    doi = "10.18653/v1/2020.wnut-1.7",
    pages = "41--49"
}

Activity Overview Activity Overview

  • Downloads0
  • Downloads 10
  • Views 98
  • File Size 2.03 MB

Tags Tags

  • Code-Mixing
  • Data-annotation
  • Hinglish

License Control License Control

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

No Record(s) Found

Select a file to preview its contents.

Data Quality Score BetaData Quality Score Beta

Version Control Version Control

FolderVersion 1(2.03 MB)
  • admin·8 month(s) ago
    • text/csv
      PHINC.csv