V-MAGE: A New Game-Based Benchmark for Evaluating Visual Abilities of Multimodal Large Language Models

Multimodal Language Models in Game Testing: V-MAGE Assesses Visually-Centric Abilities

The rapid development of Multimodal Large Language Models (MLLMs) has led to impressive advancements in various multimodal benchmarks. However, as evaluation shifts from static datasets to dynamic real-world environments, current game-based benchmarks are reaching their limits. They lack visually-centric tasks and the ability to assess the diverse reasoning skills required for real-world decisions.

To address this gap, V-MAGE (Visual-centric Multiple Abilities Game Evaluation) has been developed, a game-based evaluation framework specifically designed to assess the visually-centric abilities of MLLMs. V-MAGE comprises five different games with over 30 handcrafted levels that test models on fundamental visual skills like positioning, trajectory tracking, timing, and visual memory, as well as higher-level cognitive processes such as long-term planning and decision-making.

V-MAGE Reveals Weaknesses of Current MLLMs

Applying V-MAGE to leading MLLMs has revealed significant challenges in their visual perception and reasoning. Across all game environments, the best-performing MLLMs, measured by Elo rating comparisons, demonstrate a significant performance gap compared to human players. These results highlight critical limitations, including various types of perceptual errors made by the models.

Potential for Improvements

Analyzing the test results offers valuable insights into potential approaches for improvement from an agent-centric perspective. These include refining agent strategies and addressing perceptual inaccuracies. The identified weaknesses suggest that future research should focus on enhancing the visual processing capabilities of MLLMs to improve their performance in complex, dynamic environments.

V-MAGE as an Important Step Towards Realistic Evaluation

V-MAGE represents a significant step towards more realistic evaluation of MLLMs. By combining various games and difficulty levels, the framework enables a comprehensive analysis of the visual skills and reasoning processes of these models. The results provide valuable information for the further development of MLLMs and their application in real-world scenarios that require a high degree of visual intelligence.

Outlook

The development of MLLMs is progressing rapidly. V-MAGE offers an important tool for objectively measuring progress in this field while simultaneously identifying areas that require further research. The continuous improvement of frameworks like V-MAGE is crucial to ensure the development of robust and reliable MLLMs for future applications. Particularly in the context of companies like Mindverse, which specialize in developing customized AI solutions, precise evaluation instruments are essential to ensure the performance and reliability of MLLMs in real-world applications such as chatbots, voicebots, and AI search engines.

Bibliographie: Zheng, X., Li, L., Yang, Z., Yu, P., Wang, A. J., Yan, R., Yao, Y., & Wang, L. (2025). V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models. arXiv preprint arXiv:2504.06148. https://arxiv.org/abs/2504.06148 https://arxiv.org/html/2504.06148v1 https://paperreading.club/page?id=298228 https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/06710.pdf https://openreview.net/forum?id=jpypMKAsO6 https://www.researchgate.net/publication/389821565_How_Do_Multimodal_Large_Language_Models_Handle_Complex_Multimodal_Reasoning_Placing_Them_in_An_Extensible_Escape_Game https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models https://dl.acm.org/doi/10.1145/3641289 https://aclanthology.org/2024.findings-acl.64.pdf https://coling2025.org/program/main_conference_papers/