Evaluating AI Agent Performance in Simulated Work Environments: TheAgentCompany Benchmark

The Performance of AI Agents in Real-World Work Environments: TheAgentCompany Benchmark

Interacting with computers is ubiquitous, both in private and professional contexts. Many workflows can now be handled entirely digitally. Parallel to this, the advancement of large language models (LLMs) has led to a rapid development of AI agents that can interact with their environment and effect changes. But how effective are these AI agents actually in supporting or even autonomously executing work-related tasks?

This question is crucial both for companies looking to integrate AI into their workflows and for economic policy to understand the impact of AI adoption on the labor market. To measure the progress of LLM agent performance in executing real-world professional tasks, TheAgentCompany was developed. This benchmark evaluates AI agents that interact with the world in a similar way to digital workers: by browsing the internet, writing code, running programs, and communicating with other "employees".

TheAgentCompany: A Simulated Corporate Environment

TheAgentCompany creates a closed environment with internal websites and data that simulates a small software company. Within this environment, various tasks were created that could be performed by employees in such a company. The tested AI agents are based on both closed API-based and open-weight language models. The results show that the most powerful agent was able to complete 24% of the tasks autonomously.

A Nuanced Picture of Task Automation

These results paint a nuanced picture of task automation with LLM agents. While many simpler tasks can be solved autonomously in a simulated work environment, more complex tasks with a long-term horizon are still beyond the reach of current systems. The study underscores the need for further research to improve the capabilities of AI agents in complex scenarios.

The Importance of Benchmarks for AI Development

The development of meaningful benchmarks is essential for the evaluation and progress of AI systems. TheAgentCompany provides valuable insights into the current strengths and weaknesses of LLM agents. The results of this study can help to advance the development of AI agents in a more targeted manner and make their use in the real world more effective. Future research should focus on improving the capabilities of AI agents in complex, long-term tasks to fully exploit the potential of this technology.

Outlook

The ongoing development of AI agents promises a transformative impact on the world of work. Benchmarks like TheAgentCompany are crucial for measuring progress and identifying the challenges that still need to be overcome. The ability of AI agents to autonomously perform complex tasks will fundamentally change the world of work in the coming years. Research in this area is therefore of great importance in order to optimally shape the collaboration between humans and machines and to maximize the benefits of AI for the economy and society.

Sources Open Philanthropy. *[On hiatus] Request for proposals: benchmarking LLM agents on consequential real-world tasks*. https://www.openphilanthropy.org/rfp-llm-benchmarks/ Su, Zhe et al. *TheAgentCompany: Benchmarking LLM Agents on Consequential Real Worldn Tasks*. https://openreview.net/pdf/b1b1c2b7486d862f73817900b5f3336c45483c80.pdf Chan, Lawrence. *Open Phil releases RFPs on LLM Benchmarks and Forecasting*. https://www.alignmentforum.org/posts/ccNggNeBgMZFy3FRr/open-phil-releases-rfps-on-llm-benchmarks-and-forecasting Xu, Frank F. et al. *AgentBench: Evaluating LLMs as Agents*. https://arxiv.org/pdf/2406.12045 THUDM. *AgentBench*. https://github.com/THUDM/AgentBench Brown, Sam et al. *Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents*. https://www.alignmentforum.org/posts/s9zd6f9eZ8qN2jrcu/auto-enhance-developing-a-meta-benchmark-to-measure-llm Perez, Beren. *Benchmarking LLM agents on Kaggle competitions*. https://www.lesswrong.com/posts/cZHezHezooJ4ryiro/benchmarking-llm-agents-on-kaggle-competitions Liu, Xiao et al. *AgentBench: Evaluating LLMs as Agents*. https://arxiv.org/abs/2308.03688 Jiang, Yuxuan et al. *AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents*. https://www.researchgate.net/publication/384887247_AgentHarm_A_Benchmark_for_Measuring_Harmfulness_of_LLM_Agents?_tp=eyJjb250ZXh0Ijp7InBhZ2UiOiJzY2llbnRpZmljQ29udHJpYnV0aW9ucyIsInByZXZpb3VzUGFnZSI6bnVsbH19 Li, Boxuan et al. *PyBench: Evaluating LLM Agent on various real-world coding tasks*. https://paperswithcode.com/paper/pybench-evaluating-llm-agent-on-various-real/review/ ```