HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text
About Dataset
HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text is a high-quality Hindi-English code-mixed dataset for the NLG tasks, manually annotated by five annotators.
The dataset contains the following columns:
A. English, Hindi: The parallel source sentences from the IITB English-Hindi parallel corpus.
B. Human-generated Hinglish: A list of Hinglish sentences generated by the human annotators.
C. WAC: Hinglish sentence generated by the WAC algorithm (see paper for more details).
D. WAC rating1, WAC rating2: Quality rating to the Hinglish sentence generated by the WAC algorithm. The quality rating ranges from 1-10.
E. PAC: Hinglish sentence generated by the PAC algorithm (see paper for more details).
F. PAC rating1, PAC rating2: Quality rating to the Hinglish sentence generated by the PAC algorithm. The quality rating ranges from 1-10.
Dataset Description:
- Curated by: Lingo Research Group at IIT Gandhinagar
- Language(s) (NLP): Bilingual (Hindi [hi], English [en])
- Licensed by: cc-by-4.0
Citation:
If you use this dataset, please cite the following work:
@inproceedings{srivastava-singh-2021-hinge,
title = "{H}in{GE}: A Dataset for Generation and Evaluation of Code-Mixed {H}inglish Text",
author = "Srivastava, Vivek and
Singh, Mayank",
booktitle = "Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",