
PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLM
Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs' capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks that can inflate model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts in a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs. The code and dataset are openly available to the NLP community at https://github.com/PythonSaga/PythonSaga.
@inproceedings{yadav-etal-2024-pythonsaga,
title = "{P}ython{S}aga: Redefining the Benchmark to Evaluate Code Generating {LLM}s",
author = "Yadav, Ankit and
Beniwal, Himanshu and
Singh, Mayank",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.996/",
doi = "10.18653/v1/2024.findings-emnlp.996",
pages = "17113--17126"
}Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
To preview this file, you need to be a registered user. Please complete the registration process to gain access and continue viewing the content.
No File(s) Found!
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.