EmbodiedEval: A New Benchmark for Evaluating Multimodal Language Models in Embodied Tasks

Multimodal Language Models Put to the Test: EmbodiedEval Sets New Standards for Embodied Agents

Multimodal Large Language Models (MLLMs) have made remarkable progress in recent years, opening up promising possibilities for use in embodied agents – AI agents that operate in simulated or real-world environments. Previous benchmarks for evaluating MLLMs focused primarily on static images or videos, limiting assessment to non-interactive scenarios. Existing benchmarks for embodied AI, on the other hand, are often task-specific and not diverse enough to comprehensively evaluate the capabilities of MLLMs in interactive environments.

To address this gap, EmbodiedEval was developed, a comprehensive and interactive benchmark for MLLMs in embodied tasks. EmbodiedEval comprises 328 diverse tasks across 125 varied 3D scenes, carefully selected and annotated. The benchmark covers a broad spectrum of existing embodied AI tasks with significantly increased diversity, all within a unified simulation and evaluation framework specifically tailored for MLLMs.

The Five Categories of EmbodiedEval

The tasks in EmbodiedEval are divided into five categories to assess different agent capabilities:

Navigation: Agents must navigate complex 3D environments and reach specific goals.

Object Interaction: Agents must interact with objects in the environment, e.g., pick up, put down, or manipulate objects.

Social Interaction: Agents must interact with other agents or virtual characters in the environment.

Attribute Question Answering: Agents must answer questions about the attributes of objects in the environment.

Spatial Question Answering: Agents must answer questions about spatial relationships between objects in the environment.

Evaluation of Current MLLMs and Outlook

Initial evaluations of state-of-the-art MLLMs using EmbodiedEval have shown that they still exhibit significant deficits compared to human performance levels in embodied tasks. The analysis highlights the limitations of current MLLMs regarding embodied capabilities and provides valuable insights for their future development. The results underscore the need for further research and development in this area to improve the performance of MLLMs in interactive environments.

EmbodiedEval is available as an open-source project, including all evaluation data and the simulation framework. This allows researchers and developers to evaluate the capabilities of their MLLMs and contribute to the further development of the benchmark. The disclosure of the data and framework promotes transparency and reproducibility of results and contributes to the advancement of the research field.

The development of EmbodiedEval represents an important step towards a more comprehensive evaluation of MLLMs. The benchmark provides a valuable resource for the development and improvement of embodied agents and contributes to unlocking the potential of MLLMs for real-world applications.

Bibliography: - https://arxiv.org/abs/2501.11858 - https://arxiv.org/html/2501.11858v1 - https://github.com/thunlp/EmbodiedEval - https://paperreading.club/page?id=279233 - https://openaccess.thecvf.com/content/CVPR2024/papers/Yang_Embodied_Multi-Modal_Agent_trained_by_an_LLM_from_a_Parallel_CVPR_2024_paper.pdf - https://research.google/blog/palm-e-an-embodied-multimodal-language-model/ - https://neurips.cc/virtual/2024/poster/97552 - https://arxiv-sanity-lite.com/?rank=pid&pid=2410.03450 - https://openreview.net/forum?id=0Gl5WxY6es&referrer=%5Bthe%20profile%20of%20Zsolt%20Kira%5D(%2Fprofile%3Fid%3D~Zsolt_Kira1) - https://aclanthology.org/2024.acl-long.37.pdf