GTR Enhances Visual Language Model Reasoning

Overcoming Thought Blockages in AI Training: GTR Optimizes Reasoning in Visual Language Models

Artificial intelligence (AI) is developing rapidly, particularly in the field of visual language models (VLMs). VLMs are capable of interpreting images and generating text or performing actions based on them. A promising approach to training these models is Reinforcement Learning with Verifiable Outcome Rewards (RLVR), which has already shown success in improving logical reasoning processes ("Chain-of-Thought", CoT) in large language models (LLMs). However, the application of RLVR to VLMs, especially in the context of goal-directed actions in visual environments, is still largely unexplored.

A recent study investigates the effectiveness of RLVR in training VLMs for goal-directed action in visual environments. The researchers focused on complex card games like "24 Game" and tasks from the simulated environment ALFWorld. They found that training based solely on the outcomes of actions does not lead VLMs to develop complex reasoning processes. Instead, they observed a phenomenon they call "thought collapse." This is characterized by a rapid loss of diversity in the agent's reasoning processes. The chains of thought become state-independent, incomplete, and lead to invalid actions, which in turn result in negative rewards.

To counteract this thought collapse, the study emphasizes the need for process guidance during training. The researchers propose an automated correction mechanism that evaluates and refines the agent's reasoning processes at each step of reinforcement learning. This approach, called GTR (Guided Thought Reinforcement), trains both logical reasoning and action execution simultaneously, without the need for detailed, step-by-step human labeling. This significantly simplifies the training process and makes it more scalable.

The experiments conducted show that GTR significantly improves the performance and generalization ability of the LLaVA-7b model in various visual environments. Compared to state-of-the-art models, LLaVA-7b with GTR achieved a three- to five-fold higher success rate on the tasks – with a significantly smaller model size. These results highlight the potential of GTR to advance the development of robust and powerful VLMs for complex tasks in visual environments.

The implications of this research are far-reaching. By improving the reasoning abilities of VLMs, they could in the future handle complex tasks in real-world environments, for example in robotics, navigation, or human-computer interaction. The development of efficient and scalable training methods like GTR is therefore a crucial step towards broader application of AI in our everyday lives.

Bibliographie: - https://arxiv.org/abs/2503.08525 - https://arxiv.org/html/2503.08525v1 - http://paperreading.club/page?id=291069 - https://iclr.cc/virtual/2025/papers.html - http://paperreading.club/category?cate=Action