MultiAgentBench: A New Benchmark for Evaluating Multi-Agent LLM Capabilities

Cooperation and Competition: MultiAgentBench Evaluates the Capabilities of LLM Agents

Large language models (LLMs) have demonstrated remarkable capabilities as autonomous agents. However, existing benchmarks either focus on single-agent tasks or are limited to narrow domains. This fails to adequately capture the dynamics of multi-agent coordination and competition. A new benchmark called MultiAgentBench aims to address this gap and comprehensively evaluate the performance of LLM-based multi-agent systems in various interactive scenarios.

MultiAgentBench: A New Benchmark for Multi-Agent Systems

MultiAgentBench goes beyond simply evaluating task completion. The benchmark also measures the quality of cooperation and competition using novel, milestone-based performance indicators. These milestones enable a more detailed analysis of agent behavior and provide insights into the strengths and weaknesses of different coordination strategies.

Coordination Strategies and Innovative Approaches

The benchmark explores various coordination mechanisms, including star, chain, tree, and graph topologies. Furthermore, innovative strategies such as group discussions and cognitive planning are evaluated to explore the potential of LLMs for complex interaction. The results show that the graph structure achieves the best performance among the coordination protocols in research scenarios.

Results and Insights

Initial results indicate that, for example, gpt-4o-mini achieves the highest average task score. Moreover, it was shown that cognitive planning improves milestone success rates by 3%. These results highlight the potential of advanced planning methods in multi-agent systems.

Open Access for the Research Community

The code and datasets of MultiAgentBench are publicly available. This allows researchers to reproduce the results, conduct their own experiments, and contribute to the further development of the benchmark. Open availability promotes transparency and exchange within the research community and accelerates the development of robust and powerful multi-agent systems.

Future Perspectives

MultiAgentBench represents an important step in the evaluation and development of LLM-based multi-agent systems. Future research could focus on expanding the benchmark with additional scenarios and developing new evaluation metrics. The insights gained can contribute to a better understanding of the capabilities of LLMs in complex, collaborative environments and unlock their potential for real-world applications.

Bibliography: - https://arxiv.org/abs/2408.15971 - https://arxiv.org/abs/2308.03688 - https://www.researchgate.net/publication/383494007_BattleAgentBench_A_Benchmark_for_Evaluating_Cooperation_and_Competition_Capabilities_of_Language_Models_in_Multi-Agent_Systems - https://openreview.net/pdf/3dde663c1d03785c5b1c45a070a2ccb8c9e0d8e9.pdf - https://github.com/THUDM/AgentBench - https://nanonets.com/webflow-bundles/feb23update/RAG_for_PDFs/build_v6/pdfs/agentbench-evaluating-llms-as-agents.pdf - https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_Teaching/Large_Language_Models_and_Agents/FSS2025/IE686_LA_04_LLMAgentsAndToolUse.pdf - https://neurips.cc/virtual/2024/poster/97853 - https://www.researchgate.net/publication/387975271_Multi-Agent_Collaboration_Mechanisms_A_Survey_of_LLMs - https://openreview.net/forum?id=zAdUB0aCTQ