New Benchmarks Evaluate Self-Calling Code Generation in Large Language Models

New Benchmarks for Large Language Models: HumanEval Pro and MBPP Pro Test Self-Calling Code

The development and evaluation of large language models (LLMs) is progressing rapidly. A key aspect of this is the ability of these models to generate functionally correct code. Previous benchmarks like HumanEval and MBPP have effectively tested the code generation capabilities of LLMs. However, they mainly focused on generating code that can be executed directly. New research now introduces two extended benchmarks: HumanEval Pro and MBPP Pro. These benchmarks aim to evaluate the ability of LLMs to generate self-calling code – code that defines functions and then calls these functions within the same code sequence.

This capability is crucial for more complex programming tasks, as it promotes the modularity and reusability of code. HumanEval Pro and MBPP Pro are based on the existing HumanEval and MBPP benchmarks and extend them with test cases that explicitly require the generation of self-calling code. The new benchmarks thus offer a more nuanced evaluation of the code generation capabilities of LLMs.

The Challenge of Self-Calling Code

Generating self-calling code presents new challenges for LLMs. Models must not only master the syntax and semantics of the programming language but also be able to understand the context and dependencies within the generated code. They must define functions correctly and then call them with the correct parameters to achieve the desired functionality. This requires a deeper understanding of programming logic and the ability to grasp complex relationships.

HumanEval Pro and MBPP Pro: Structure and Functionality

HumanEval Pro extends the existing HumanEval benchmark, which consists of 164 handwritten programming tasks in Python. The new test cases in HumanEval Pro examine the ability of LLMs to define functions and call them within the generated code. MBPP Pro is based on the MBPP benchmark, which consists of 1000 programming problems taken from online programming competitions. Similar to HumanEval Pro, MBPP Pro contains additional test cases that require the generation of self-calling code. Both benchmarks provide detailed information about the performance of the LLMs, including the success rate in generating functionally correct code.

Impact on LLM Development

The introduction of HumanEval Pro and MBPP Pro offers LLM developers new opportunities to evaluate and improve the capabilities of their models. The benchmarks help to identify the strengths and weaknesses of the models in terms of generating self-calling code and to make targeted optimizations. This contributes to further increasing the performance of LLMs in the field of code generation and paves the way for new applications in software development.

Future Research

Research in the field of code generation with LLMs is dynamic and promising. Future work could focus on the development of further benchmarks that cover even more complex programming tasks, such as generating code for specific application areas or integrating LLMs into existing development environments. The continuous improvement of evaluation methods will contribute to further increasing the performance of LLMs in the field of code generation and opening up new possibilities for software development.

Bibliographie: Chen, M. et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. OpenAI. (n.d.). human-eval. GitHub repository. https://github.com/openai/human-eval Liu, M. et al. (2024). ClassEval: A Multi-Faceted Evaluation Framework for Code Large Language Models. ICSE 2024. Nguyen, Q. K., et al. (2024). Investigating the Impact of Instruction Tuning on Code Generation Performance of Large Language Models. arXiv preprint arXiv:2407.07565. Jiang, J., et al. (2024). Code Generation with Large Language Models: A Survey. EMNLP 2024. Srivastava, A., et al. (2024). Beyond function correctness: Measuring the quality of code generated by large language models. OpenReview. Jiang, J. (n.d.). CodeLLMSurvey. GitHub repository. https://github.com/juyongjiang/CodeLLMSurvey Rad, S. G. et al. (2024). A Survey on Evaluating Large Language Models in Code Generation Tasks. ResearchGate. Ahmed, A. L. et al. (2024). Evaluating Code Generation Performance of Large Language Models. DiVA. Papers with Code. (n.d.). HumanEval. https://paperswithcode.com/task/humaneval