MUTANT is a high-quality Hindi-English code-mixed dataset designed for tasks related to multi-sentential text processing, particularly focusing on summarization and evaluation.
DATA Sources:
MUTANT dataset comprises code-mixed long-length texts extracted from two main sources:
1. Political Speeches & Press Releases: Collected from government portals and political party websites.
2. Hindi News Articles: Extracted from leading Hindi news websites, ensuring high-quality and formal Hinglish text.
The final dataset contains multiple documents, each of which contains at least one code-mixed MCT. The dataset includes a total of 67007 documents with 84937 MCTs. A significant portion of the documents (44913) belong to the Dainik Jagran dataset.
Dataset Attribution & Licensing
@inproceedings{gupta-etal-2023-mutant,
title = "{MUTANT}: A Multi-sentential Code-mixed {H}inglish Dataset",
author = "Gupta, Rahul and
Srivastava, Vivek and
Singh, Mayank",
editor = "Vlachos, Andreas and
Augenstein, Isabelle",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-eacl.56/",
doi = "10.18653/v1/2023.findings-eacl.56",
pages = "744--753"
}Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
To preview this file, you need to be a registered user. Please complete the registration process to gain access and continue viewing the content.
No File(s) Found!
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.