Vision-Language Models and Human-like Categorization

Categorization in Vision-Language Models: Do They Mirror Human Behavior?

The way humans categorize objects is a complex cognitive process that has been the subject of psychological research for decades. A key concept in this field is the "basic category," a term coined by Eleanor Rosch in 1976. This level of categorization is preferentially used by humans, offers a high density of information, and plays an important role in processing visual information and language production. Recent studies are now investigating whether and to what extent modern vision-language models (VLMs) exhibit similar categorization patterns.

The Basic Category and its Significance

The basic category represents a kind of middle ground in the hierarchy of categorization. It is more specific than superordinate categories (e.g., "animal" instead of "living being") and more general than subordinate categories (e.g., "dog" instead of "Golden Retriever"). Humans use the basic category most often in everyday life because it offers an optimal balance between information content and processing effort. For example, we are more likely to recognize an animal as a "dog" than as a "mammal" or a "Dachshund."

VLMs and the Basic Category

Current research suggests that VLMs also show a preference for the basic category, similar to humans. Studies with models like Llama 3.2 Vision Instruct and Molmo 7B-D have shown that these models tend to use terms corresponding to the basic category when describing images. This suggests that VLMs implicitly learn the cognitive categorization patterns of humans through training with large datasets of human language and images.

Nuances of Categorization in VLMs

However, the similarity between human and machine categorization goes beyond the simple preference for the basic category. VLMs also show finer nuances that are known in human cognition. For example, the "expert effect" has been observed: When a VLM is fed with information specialized in a particular area (e.g., ornithology), its categorization shifts towards subordinate categories (e.g., "blackbird" instead of "bird"). This corresponds to the behavior of human experts, who make more detailed distinctions in their field of expertise.

Another interesting aspect is the distinction between biological and non-biological objects. Studies have shown that humans are more likely to use the basic category when categorizing living things than with inanimate objects. This pattern has also been observed in VLMs, further highlighting the parallel to human cognition.

Outlook

Research on categorization in VLMs is still in its early stages, but the results so far are promising. They suggest that these models are not only capable of recognizing and describing images, but also of replicating more complex cognitive processes of humans. Further research is necessary to fully understand the mechanisms behind these phenomena and to exploit the potential of VLMs for applications in areas such as image search, human-computer interaction, and artificial intelligence in general.

Bibliographie: https://arxiv.org/abs/2210.07183 https://huggingface.co/blog/vlms https://arxiv.org/html/2405.17247v1 https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/06472.pdf https://www.researchgate.net/publication/365889707_Exploiting_Category_Names_for_Few-Shot_Classification_with_Vision-Language_Models https://proceedings.neurips.cc/paper_files/paper/2024/file/aee5298251a418aad89618cf6b5e7ccc-Paper-Conference.pdf https://openreview.net/forum?id=sQ0TzsZTUn https://huggingface.co/blog/vision_language_pretraining https://openaccess.thecvf.com/content/CVPR2024/papers/Sharma_A_Vision_Check-up_for_Language_Models_CVPR_2024_paper.pdf https://aclanthology.org/2024.cmcl-1.2.pdf