Reinforcement Learning Improves Generalization in Foundation Models

Comparative Study: Generalization and Memorization in Foundation Models through Supervised Fine-Tuning and Reinforcement Learning

The advancement of foundation models through post-training methods is a central topic in current AI research. Two of the most common procedures are Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). While both methods can improve model performance, their respective roles in increasing generalization ability are not yet fully understood. A new study now investigates the differences between SFT and RL with regard to generalization and memorization, particularly in dealing with text-based and visual variations of tasks.

The researchers present "GeneralPoints," a card game based on arithmetic reasoning, and use V-IRL, a real-world navigation environment. With these two scenarios, one text-based and one visual, they examine how well models trained with SFT and RL generalize to unseen variations. The results show that RL, especially with outcome-based rewards, exhibits better generalization in both rule-based text variations and visual variations. SFT, on the other hand, tends to memorize the training data and has difficulties with out-of-distribution scenarios.

Further analyses suggest that RL improves the underlying visual recognition capabilities of the model, contributing to better generalization in the visual domain. Despite the superior generalization ability of RL, SFT remains an important component for effective RL training. SFT stabilizes the output format of the model, allowing the subsequent RL to achieve its performance gains. This synergistic relationship between SFT and RL underscores the importance of both methods for the development of powerful AI models.

The Importance of Generalization for AI Models

The ability to generalize is crucial for the practical application of AI models. Models should not only master the training data but also be applicable to unknown, yet similar tasks. The study clarifies that RL, particularly in combination with SFT, is a promising approach to promote this generalization ability. The results suggest that RL-based training methods enable models to learn fundamental principles and relationships, instead of merely relying on the specific examples of the training data.

Outlook and Future Research

The findings of this study open new perspectives for the further development of foundation models. Future research could focus on determining the optimal combination of SFT and RL and further investigating the role of reward functions in RL training. A better understanding of these relationships could lead to even more powerful and robust AI models that can be used in a variety of application areas. The development of AI models capable of solving complex tasks in various contexts is an important step towards truly intelligent Artificial Intelligence.

Bibliographie: Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q. V., Levine, S., & Ma, Y. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training. ChatPaper. (n.d.). Retrieved from https://chatpaper.com/chatpaper/zh-CN/paper/103149 Anonymous. (2024). Title Suppressed Due to Double-Blind Review. Proceedings of the International Conference on Machine Learning. ML Research. (n.d.). Retrieved from https://ml-research.github.io/ NeurIPS 2024. (n.d.). Retrieved from https://nips.cc/virtual/2024/papers.html NeurIPS 2024 Datasets and Benchmarks. (n.d.). Retrieved from https://neurips.cc/virtual/2024/events/datasets-benchmarks-2024 Anonymous. (2024). Title Suppressed Due to Double-Blind Review. arXiv preprint arXiv:2403.10131v1. ICLR 2024 Oral Presentations. (n.d.). Retrieved from https://iclr.cc/virtual/2024/events/oral Trustworthy AI Group. (n.d.). Adversarial Examples Papers. Retrieved from https://github.com/Trustworthy-AI-Group/Adversarial_Examples_Papers OpenReview. (n.d.). Retrieved from https://openreview.net/attachment?id=w3iM4WLuvy&name=pdf