How Language Influences Visual Bias in Vision Language Models

Visual Perception of AI Models: Shape or Texture?

Vision Language Models (VLMs) have revolutionized computer vision in recent years. They enable a variety of new applications, from zero-shot image classification and image description to answering visual questions. In contrast to purely image-processing models, VLMs offer the ability to access visual content through language. The diverse applications of these models raise the question of how closely their visual perception aligns with that of humans. In particular, it is interesting to investigate how VLMs adopt human visual biases through the multimodal fusion of image and text, or whether they simply inherit them from pure image processing models.

A significant visual bias is the so-called texture-shape bias, which describes whether local or global information is dominant. Simply put: Does the model focus more on the fine details of the surface (texture) or on the general shape of an object? Humans tend strongly to recognize objects based on their shape. This study investigates how this bias is expressed in a range of common VLMs.

Surprisingly, the research shows that VLMs are often more focused on shape than their underlying image processing models. This suggests that visual biases are modulated by text in multimodal models. If text does indeed influence visual biases, it could mean that we can control these biases not only through visual input, but also through language.

This hypothesis was confirmed by extensive experiments. The researchers succeeded in controlling the shape bias solely through the formulation of prompts, i.e., the text inputs for the model, from 49% up to 72%. However, the strong human shape bias of 96% remains unattainable for all VLMs tested.

The Significance of the Research for the Development of AI

The results of this study are of great importance for the further development of AI models. They show that the integration of text into visual models not only opens up new application possibilities, but also influences the way these models process visual information. Understanding and controlling visual biases is crucial for developing AI systems that function reliably and robustly in real-world application scenarios. The ability to control biases through language opens up new avenues for developing AI partners that can adapt to the specific needs and requirements of their human users.

For companies like Mindverse, which specialize in the development of customized AI solutions, these findings are particularly relevant. The development of chatbots, voicebots, AI search engines, and knowledge systems benefits from a deeper understanding of the visual perception of AI models. By specifically controlling biases, these systems can be trained to better meet user expectations and deliver more precise results.

Research in this area is far from complete. Further studies are necessary to fully understand the complex relationships between text, image, and visual perception in AI models. However, the results so far suggest that controlling visual biases through language is a promising approach for developing the next generation of AI systems.

Bibliography: Gavrikov, P., Lukasik, J., Jung, S., Geirhos, R., Lamm, B., Mirza, M. J., Keuper, M., & Keuper, J. (2024). Are Vision Language Models Texture or Shape Biased and Can We Steer Them?. *arXiv preprint arXiv:2403.09193*.