CodeMonkeys Improves Software Development with Scaled Test-Time Compute

AI-Powered Software Development: CodeMonkeys Unlocks New Potential by Scaling Test-Time Compute

Improving the capabilities of large language models (LLMs) by scaling test-time compute is a promising approach. However, implementing this scaling offers various possibilities, and the effective combination of different approaches remains a subject of ongoing research. A recently published paper titled "CodeMonkeys: Scaling Test-Time Compute for Software Engineering" investigates this challenge in the context of solving real-world problems from the GitHub repository SWE-bench. This dataset serves as a benchmark for the automated handling of software development tasks.

The system presented in the paper, CodeMonkeys, allows models to iteratively edit a codebase by generating and executing a test script along with the proposed code edit. For each problem, several such multi-step editing runs are created to generate a collection of candidate edits. This approach allows for scaling test-time compute both "serially," by increasing the iterations per editing run, and "parallelly," by increasing the number of runs per problem.

Parallel scaling enables the amortization of initial costs across multiple downstream samples. This allows the relevant context of the codebase to be identified through the simple method of having the LLM read all files completely. To select among the generated code edits, CodeMonkeys combines a voting procedure based on the model-generated tests with a final multi-step editing run specifically designed for selection.

The results show that CodeMonkeys could solve 57.4% of the problems from the SWE-bench Verified dataset with a budget of approximately $2300. Notably, the selection method can also be used to combine candidates from different sources. By selecting from an ensemble of edits from existing top submissions from SWE-bench Verified, a success rate of 66.2% was achieved, surpassing the performance of the best single submission.

The research findings highlight the potential of scaling test-time compute for improving LLMs in the field of software development. The combination of parallel processing, automated test generation, and intelligent selection methods makes it possible to solve complex programming problems effectively. The release of the code and data by the researchers opens up further opportunities for the advancement and application of this promising approach.

Scaling test-time compute opens new avenues for automating software development tasks. CodeMonkeys' ability to independently generate and execute tests represents an important step towards autonomous systems capable of handling complex programming tasks. Further research in this area promises exciting developments for the future of software development.

Bibliographie: - https://arxiv.org/abs/2501.14723 - https://arxiv.org/pdf/2501.14723? - https://deeplearn.org/arxiv/570059/codemonkeys:-scaling-test-time-compute-for-software-engineering - https://arxiv-sanity-lite.com/?rank=pid&pid=2501.14723 - https://www.linkedin.com/posts/samanthkoduru_the-idea-of-scaling-test-time-compute-offers-activity-7276327394624385025-Z2BP - https://github.com/FudanSELab/Agent4SE-Paper-List - https://www.researchgate.net/publication/382739350_Large_Language_Monkeys_Scaling_Inference_Compute_with_Repeated_Sampling - https://openreview.net/forum?id=4FWAwZtd2n - https://www.cognition.ai/blog/swe-bench-technical-report - https://www.youtube.com/watch?v=QWoslkjR9W4 ```

CodeMonkeys Improves Software Development with Scaled Test-Time Compute

Top post

AI-Powered Software Development: CodeMonkeys Unlocks New Potential by Scaling Test-Time Compute

Related blog

Multi-Turn Jailbreaks and Defenses: Enhancing LLM Security

Off-Policy Learning Enhances Reasoning Abilities in AI Models

SphereDiff Generates Seamless 360° Panoramas Without Finetuning