Visual Reinforcement Fine-Tuning Improves Performance of Large Vision-Language Models

Visual Reinforcement Fine-Tuning: A New Approach for Fine-tuning Large Vision-Language Models

Fine-tuning large AI models is a crucial step in optimizing their performance for specific tasks. Traditionally, this is done through Supervised Fine-Tuning (SFT), which relies on large amounts of labeled data. A newer approach, Reinforcement Fine-Tuning (RFT), promises to reduce the need for such large datasets by learning from feedback on the model's responses. This approach has already shown success in the field of language models, but its application to multimodal domains that process both text and images has been less explored.

Recent research introduces Visual Reinforcement Fine-Tuning (Visual-RFT), a method that applies RFT to visual tasks. Visual-RFT uses large Vision-Language Models (LVLMs) to initially generate multiple responses for a given input. These responses contain both so-called reasoning tokens, which represent the model's thought process, and the final answer. Subsequently, specialized, visually verifiable reward functions are used to update the model through policy optimization algorithms, such as Group Relative Policy Optimization (GRPO).

The key to the success of Visual-RFT lies in the design of the reward functions. These functions are tailored to the specific perception task. For example, for object detection, the Intersection over Union (IoU) is used as a reward function. This metric measures the overlap between the predicted bounding box and the actual bounding box of an object.

Experimental results on various benchmarks, including fine-grained image classification, few-shot object detection, reasoning grounding, and open-vocabulary object detection, demonstrate the effectiveness of Visual-RFT. Compared to SFT, Visual-RFT achieves compelling results, especially with limited data. For instance, in fine-grained image classification, an accuracy improvement of 24.3% over the baseline was achieved with only about 100 training examples. In few-shot object detection, Visual-RFT also surpasses the baseline, with increases of 21.9% in the two-shot setting of COCO and 15.4% with LVIS.

These results suggest that Visual-RFT is a promising approach for fine-tuning LVLMs. The method enables data-efficient and reward-driven adaptation to specific tasks and improves both the reasoning ability and adaptability of the models. This opens up new possibilities for the use of LVLMs in areas where large amounts of labeled data are difficult to obtain.

The development of Visual-RFT is in the context of ongoing research in reinforcement learning and the ever-growing importance of multimodality in AI. By combining these two areas, Visual-RFT opens up new avenues for optimizing AI models and contributes to pushing the boundaries of what is possible in artificial intelligence.

For companies like Mindverse, which specialize in the development of AI solutions, approaches like Visual-RFT offer exciting opportunities. The data-efficient nature of the method could significantly simplify and accelerate the development of customer-specific AI models, such as chatbots, voicebots, or AI search engines. This underscores the potential of Visual-RFT to significantly influence the development and application of AI in practice.

Bibliography: https://huggingface.co/papers/2503.01785 https://huggingface.co/papers?ref=lorcandempsey.net https://www.arxiv.org/pdf/2502.15214 https://www.researchgate.net/scientific-contributions/Yuhui-Chen-2293988207 https://arxiv.org/html/2502.05450v1 https://the-decoder.de/openai-stellt-neue-finetuning-methode-fuer-individuelle-experten-ki-modelle-vor/ https://www.linkedin.com/posts/andrew-iain-jardine_supervised-fine-tuning-is-dead-long-live-activity-7290723366641012738-HSGM https://medium.com/nlplanet/openai-to-release-finetuning-for-reasoning-models-weekly-ai-newsletter-december-9th-2024-3f13a537b698 https://x.com/travisaddair https://www.linkedin.com/posts/awesomistan_announcing-new-fine-tuning-capabilities-with-activity-7278167704375599105-8Ro4

Visual Reinforcement Fine-Tuning Improves Performance of Large Vision-Language Models

Top post

Visual Reinforcement Fine-Tuning: A New Approach for Fine-tuning Large Vision-Language Models

Related blog

Multi-Turn Jailbreaks and Defenses: Enhancing LLM Security

Off-Policy Learning Enhances Reasoning Abilities in AI Models

SphereDiff Generates Seamless 360° Panoramas Without Finetuning