VisVM: Scaling Inference-Time Search for Enhanced Visual Comprehension in Vision-Language Models

Vision Value Model (VisVM): Scaling Inference-Time Search for Improved Visual Comprehension

Vision-Language Models (VLMs) have made remarkable progress in recent years. They can understand, describe, and answer questions about images. A crucial aspect for further improving these models is optimizing inference, the process by which the model generates a response based on the input. A promising approach to enhancing inference quality is scaling computations during inference time. This approach has already proven to be a central step in developing self-learning systems for large language models.

In this context, we introduce the Vision Value Model (VisVM), a novel method that controls the inference-time search of VLMs to generate responses with better visual understanding. VisVM not only evaluates the quality of the currently generated sentence but also anticipates the quality of subsequent sentences that could result from the current step. This provides VisVM with a long-term perspective, steering VLMs away from generating sentences prone to hallucinations or lacking sufficient detail. The result is higher quality responses.

How VisVM Works

VisVM is based on the idea of optimizing inference-time search through predictive evaluation. Instead of just evaluating the current sentence, VisVM also considers the potential impact of the sentence on subsequent generation. This long-term perspective allows the model to favor sentences that lead to a more comprehensive and detailed description of the image.

By anticipating future sentence quality, VisVM can effectively reduce hallucinations, i.e., the generation of content that is not present in the image. Simultaneously, VisVM encourages the generation of sentences with rich visual details, leading to a better overall understanding of the image.

Experimental Results

Experimental results demonstrate that VisVM-guided search significantly improves the ability of VLMs to generate descriptive image captions with more detailed visual information and fewer hallucinations. Compared to conventional decoding and search methods using other visual evaluation signals, VisVM performs significantly better.

Furthermore, the results show that self-training the model with the captions generated by VisVM improves the VLM's performance across a wide range of multimodal benchmarks. This indicates VisVM's potential for developing self-learning VLMs.

Outlook and Significance for Mindverse

The development of VisVM represents a significant step towards more powerful and robust VLMs. The ability to control inference-time search through long-term evaluation opens up new possibilities for improving the visual comprehension of AI systems. For Mindverse, as a provider of AI-powered content solutions, VisVM offers the potential to further enhance the quality of generated content and drive the development of innovative applications in the field of visual communication.

Bibliography Chen, Y., et al. "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. Gupta, A., et al. "Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Prediction." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2024. Li, H., et al. "LLaVA-o1: Let Vision Language Models Reason Step-by-Step." arXiv preprint arXiv:2411.10440 (2024). Wang, X., et al. "Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension." arXiv preprint arXiv:2412.03704 (2024). Xu, G., et al. "Unified Visual Relationship Detection with Vision and Language Models." arXiv preprint arXiv:2412.03704 (2024). Zhang, Y., et al. "Improving Visual Grounding by Encouraging Consistent Visual Feature Usage." Findings of the Association for Computational Linguistics: ACL 2024. 2024. Zhou, Y., et al. "Large multimodal models evaluate and enhance chain-of-thought reasoning." International Joint Conference on Artificial Intelligence. 2024. QwenLM/Qwen-VL. GitHub repository.