A New Repository-Level Benchmark for Code Generation Assessment

A New Benchmark for Evaluating Repository-Level Code Generation

Developing new features within existing codebases is a central task in software development. Large Language Models (LLMs) offer the potential to automate and accelerate this process. However, evaluating the performance of these models in this complex scenario requires specialized benchmarks. Existing benchmarks mostly focus on generating code snippets and do not provide an adequate basis for evaluating changes at the repository level.

To address this gap, FEA-Bench has been developed, a new benchmark specifically designed to assess the ability of LLMs to perform incremental development within code repositories. FEA-Bench is based on real-world pull requests from 83 GitHub repositories. Through rule-based and intent-based filtering, task instances were created that focus on the development of new features. Each of these task instances, which contain code changes, is associated with relevant unit test files. This allows for automated verification of the generated solutions and ensures that the implemented feature meets the desired requirements.

Implementing a new feature requires LLMs not only to generate new code components but also to modify existing parts of the repository. FEA-Bench thus provides a more comprehensive evaluation method for the capabilities of LLMs in the field of automated software development than previous benchmarks. The tasks included in FEA-Bench require an understanding of the entire code context and the ability to make changes at various locations within the repository.

Initial Results and Challenges

Initial experimental results with FEA-Bench show that the performance of current LLMs in this complex scenario still falls significantly short of expectations. This highlights the significant challenges associated with incremental code development at the repository level. The complexity of the tasks in FEA-Bench requires a deeper understanding of the codebase and the relationships between different components. Future research must focus on improving the capabilities of LLMs in these areas.

FEA-Bench offers a valuable foundation for the further development of code generation models. By providing realistic and complex task instances, the benchmark enables a precise evaluation of the performance of LLMs in a practically relevant scenario. The results of the evaluation with FEA-Bench can help to identify the strengths and weaknesses of current models and steer research towards more robust and effective solutions for automated software development.

The publication of FEA-Bench will significantly influence the development and improvement of LLMs for code generation. By providing a standardized benchmark, researchers can compare the performance of different models and objectively measure progress in the field of automated software development. This will ultimately lead to more efficient and reliable tools for developers and increase productivity in software development.

Bibliography: - https://huggingface.co/papers/2503.06680 - https://www.researchgate.net/publication/371311819_RepoBench_Benchmarking_Repository-Level_Code_Auto-Completion_Systems - https://arxiv.org/pdf/2406.12655 - https://aclanthology.org/2024.findings-acl.214.pdf - https://neurips.cc/virtual/2024/poster/97531 - https://www.arxiv.org/pdf/2409.10280 - https://lj2lijia.github.io/papers/EvoCodeBench_Preprint.pdf - https://www.researchgate.net/publication/381517566_REPOEXEC_Evaluate_Code_Generation_with_a_Repository-Level_Executable_Benchmark - https://github.com/YerbaPage/Awesome-Repo-Level-Code-Generation - https://openreview.net/forum?id=diXvBHiRyE