Multimodal LLMs Struggle with Small Visual Details: Training-Free Solutions Offer Improvement

Multimodal Language Models: An Eye for Detail

Multimodal Large Language Models (MLLMs) have made rapid progress in the field of visual recognition in recent years. Given their potential integration into many critical applications, it is important to understand the limits of their visual perception. A recent research paper investigates whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images.

The study shows that the performance of MLLMs strongly depends on the size of the visual subject of the question. Through targeted interventions, the researchers were able to confirm this effect as a causal relationship. Surprisingly, they found that even with incorrect answers, the MLLMs consistently focus on the relevant image areas – they "know" where to look.

Training-Free Optimization of Visual Perception

Based on these findings, the researchers propose training-free visual intervention methods that leverage the MLLM's own internal knowledge. Using attention and gradient maps, the models can improve their perception of small visual details. The proposed methods were evaluated on two widely used MLLMs and seven visual question answering benchmarks. The results show a significant improvement in the accuracy of the MLLMs without the need for training.

The research highlights the risk of using MLLMs for visual recognition tasks based on small details. At the same time, it shows that visual interventions that utilize the internal state of the model are a promising approach to minimizing this risk. The ability to direct the model's attention and specifically focus on details opens up new possibilities for the application of MLLMs in areas such as medical image analysis or quality control.

MLLMs at Mindverse

For companies like Mindverse, which specialize in the development of AI-powered content solutions, these research results are of great importance. Mindverse offers an all-in-one platform for AI texts, images, research, and more. The development of customized solutions, such as chatbots, voicebots, AI search engines, and knowledge systems, is the focus of the company. Understanding the strengths and weaknesses of MLLMs is essential to developing the best possible AI solutions for customers and optimally utilizing the limits of the technology.

The findings of this study could contribute to further optimizing the MLLM-based applications developed by Mindverse and improving accuracy in detail-rich scenarios. This opens up new possibilities for the development of innovative applications and strengthens Mindverse's position as a leading provider of AI solutions.

Bibliography: - https://arxiv.org/abs/2502.17422 - https://openreview.net/forum?id=DgaY5mDdmT - https://arxiv.org/html/2502.17422 - https://deeplearn.org/arxiv/579319/mllms-know-where-to-look:-training-free-perception-of-small-visual-details-with-multimodal-llms - https://openreview.net/pdf/c768744b5022335d4d26727e8a9871b8d8293ea1.pdf - https://www.linkedin.com/posts/prateekchhikara_iclr2025-mllms-ai-activity-7300373203531481088-Uyu- - https://huggingface.co/papers - https://synthical.com/article/MLLMs-Know-Where-to-Look%3A-Training-free-Perception-of-Small-Visual-Details-with-Multimodal-LLMs-f3273549-6276-486d-a448-9ac9a1bb3fc5? - https://www.researchgate.net/publication/386374543_Enhancing_Perception_Capabilities_of_Multimodal_LLMs_with_Training-free_Fusion - https://proceedings.neurips.cc/paper_files/paper/2024/file/4fd96b997454b5b02698595df70fccaf-Paper-Conference.pdf ```