ORGANISATION

PHINC

Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

About Dataset

PHINC (Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation): The dataset tackles challenges in translating noisy, informal, code-mixed social media text, offering 13,738 Hinglish-English sentence pairs manually annotated by 54 annotators for low-resource machine translation task.

The dataset contains the following fields:

Hinglish Code-Mixed Sentence: The original sentence in Romanized Hindi-English (Hinglish).

Human Translated English Sentence: The corresponding English translation provided by human annotators.

Dataset Description:

- Curated by: Lingo Research Group at IIT Gandhinagar

- Language(s) (NLP): Bilingual (Hindi [hi], English [en])

- Licensed by: cc-by-4.0

Citation:

If you use this dataset, please cite the following work:

@inproceedings{srivastava-singh-2020-phinc,
    title = "{PHINC}: A Parallel {H}inglish Social Media Code-Mixed Corpus for Machine Translation",
    author = "Srivastava, Vivek  and
      Singh, Mayank",
    booktitle = "Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wnut-1.7/",
    doi = "10.18653/v1/2020.wnut-1.7",
    pages = "41--49"
}