LEGO-Puzzles: A New Benchmark for Spatial Reasoning in Multimodal Large Language Models

Spatial reasoning, particularly over multiple steps, is a fundamental skill for many complex tasks, from robotics and navigation to automated assembly. How well do current multimodal large language models (MLLMs) master this challenge? A new benchmark called LEGO-Puzzles aims to answer this very question.

LEGO-Puzzles consists of 1,100 carefully selected questions and answers related to LEGO images. The tasks are divided into eleven categories and range from simple spatial understanding to complex multi-step inferences. For example, MLLMs must determine the number of LEGO bricks of a specific color, describe the relative position of objects, or follow the steps of LEGO building instructions.

The results of tests with various state-of-the-art MLLMs reveal significant weaknesses in spatial reasoning. Even the most powerful models only achieve an accuracy of about 50%, while human participants achieve over 90%. This highlights a considerable gap between human and machine performance in this area.

In addition to the question-and-answer tasks, the MLLMs were also tested on their ability to generate LEGO images based on building instructions. Here, only Gemini-2.0-Flash and GPT-4o showed limited abilities to correctly implement the instructions. Other MLLMs either reproduced the input image or generated completely irrelevant outputs.

The results of LEGO-Puzzles underscore the need for further research and development in the field of multimodal spatial reasoning. The ability to understand and reason about spatial relationships is crucial for the development of AI systems that can handle complex tasks in the real world. LEGO-Puzzles provides a valuable basis for evaluating and comparing MLLMs in this important area and can help drive the development of more powerful models.

These results are particularly interesting for companies like Mindverse, which specialize in the development of AI solutions. The development of chatbots, voicebots, AI search engines, and knowledge systems requires a deep understanding of language and context, but also the ability to process and interpret spatial information. The insights from LEGO-Puzzles can help to better understand the limitations of current AI systems and promote the development of more robust and powerful solutions.

For Mindverse, which develops customized AI solutions, LEGO-Puzzles offers valuable insights into the strengths and weaknesses of current MLLMs. The results can help focus the development of even more powerful AI systems and further expand the application areas of AI technologies.

Bibliography: - Tang, K., Gao, J., Zeng, Y., Duan, H., Sun, Y., Xing, Z., Liu, W., Lyu, K., & Chen, K. (2025). LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? arXiv preprint arXiv:2503.19990. - https://arxiv.org/pdf/2401.03991 - https://openreview.net/forum?id=GT4gMdvVFp - https://medium.com/data-science/language-models-and-spatial-reasoning-whats-good-what-is-still-terrible-and-what-is-improving-175d2099eb4c - https://www.researchgate.net/publication/382459771_Step-by-Step_Reasoning_to_Solve_Grid_Puzzles_Where_do_LLMs_Falter - https://nips.cc/virtual/2024/papers.html - https://arxiv.org/abs/2407.14790 - https://aclanthology.org/volumes/2024.findings-acl/ - https://cvpr.thecvf.com/Conferences/2024/Videos - https://aclanthology.org/2024.emnlp-main.1111/ - https://dl.acm.org/doi/10.5555/773294

LEGO-Puzzles: A New Benchmark for Spatial Reasoning in Multimodal Large Language Models

Top post

LEGO-Puzzles: A New Benchmark for Spatial Reasoning in Multimodal Large Language Models

Related blog

Multi-Turn Jailbreaks and Defenses: Enhancing LLM Security

Off-Policy Learning Enhances Reasoning Abilities in AI Models

SphereDiff Generates Seamless 360° Panoramas Without Finetuning