ORGANISATION

MUTANT

A Multi-sentential Code-mixed Hinglish Dataset

About Dataset

MUTANT is a high-quality Hindi-English code-mixed dataset designed for tasks related to multi-sentential text processing, particularly focusing on summarization and evaluation.

DATA Sources:

MUTANT dataset comprises code-mixed long-length texts extracted from two main sources:

1. Political Speeches & Press Releases: Collected from government portals and political party websites.

2. Hindi News Articles: Extracted from leading Hindi news websites, ensuring high-quality and formal Hinglish text.

The final dataset contains multiple documents, each of which contains at least one code-mixed MCT. The dataset includes a total of 67007 documents with 84937 MCTs. A significant portion of the documents (44913) belong to the Dainik Jagran dataset.

Dataset Attribution & Licensing

Curated by: Lingo Research Group at IIT Gandhinagar
Language(s) : Bilingual (Hindi [hi], English [en])
Licensed by: cc-by-4.0

Citation:

If you use this dataset, please cite the following work:

@inproceedings{gupta-etal-2023-mutant,
    title = "{MUTANT}: A Multi-sentential Code-mixed {H}inglish Dataset",
    author = "Gupta, Rahul  and
      Srivastava, Vivek  and
      Singh, Mayank",
    editor = "Vlachos, Andreas  and
      Augenstein, Isabelle",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-eacl.56/",
    doi = "10.18653/v1/2023.findings-eacl.56",
    pages = "744--753"
}

Dataset Metadata

License

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Geographical coverage

Global

Sector

Science, Technology and Research

Author

Rahul Gupta, Vivek Srivastava, Mayank Singh

Source Organisation

IITGN

Uploaded by

Lingo Research Group

Data Quality Score (Beta)

4.75

Dataset type

Structured

Frequency

Daily

Time Granularity

Year range

N.A.

Date & Time

17/07/25 05:47:24

Visibility

Open

Hosted / Redirected

Hosted

Activity Overview

0
56
1.51 MB
179

License Control

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

inc_62.txt ( 7.85 KB )

To preview this file, you need to be a registered user. Please complete the registration process to gain access and continue viewing the content.

Data Quality Score Beta

Version Control

Version 2(1.51 MB)

admin·11 month(s) ago
- inc_62.txt
- inc_76.txt
- inc_89.txt
- Part1_business_1156.txt
- Part1_business_1181.txt
- Part1_business_1618.txt
- Part1_business_1803.txt
- Part1_business_1817.txt
- Part1_business_2463.txt
- Part1_business_2477.txt
- 556 more

Version 1(169.08 MB)

admin·11 month(s) ago

No File(s) Found!

Accessibility options by UX4G

MUTANT

About Dataset

Dataset Metadata

Activity Overview

Tags

License Control

inc_62.txt ( 7.85 KB )

Data Quality Score Beta

Version Control

Version 2(1.51 MB)

inc_62.txt

inc_76.txt

inc_89.txt

Part1_business_1156.txt

Part1_business_1181.txt

Part1_business_1618.txt

Part1_business_1803.txt

Part1_business_1817.txt

Part1_business_2463.txt

Part1_business_2477.txt

Version 1(169.08 MB)

AIKosh

Resources

Support