ReLU-Based Preference Optimization (RePO) Improves Large Language Model Performance

Optimizing Large Language Models: ReLU-based Preference Optimization (RePO) as a Promising Approach

Adapting large language models (LLMs) to human preferences is crucial for their successful deployment in practice. Existing methods like Reinforcement Learning from Human Feedback (RLHF) face challenges regarding computational cost and stability. While Direct Preference Optimization (DPO) established an offline paradigm with a single hyperparameter Beta, subsequent methods like Simple Preference Optimization (SimPO) with two parameters (Beta, Gamma) reintroduce complexity.

A new algorithm called ReLU-based Preference Optimization (RePO) now promises a more efficient solution. RePO simplifies the optimization process through two key improvements: First, RePO retains the reference-free margins of SimPO but eliminates Beta through gradient analysis. Second, RePO uses a ReLU-based max-margin loss function that automatically filters out trivial pairs.

Theoretically, RePO can be characterized as a limiting case of SimPO (Beta approaching infinity) where the logistic weighting collapses to a binary threshold, forming a convex hull of the 0-1 loss. This approach simplifies optimization and reduces computational cost.

Empirical Results and Advantages of RePO

Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms both DPO and SimPO in terms of performance across various base models. A particular advantage of RePO is the need to tune only a single hyperparameter. This significantly simplifies the optimization process and makes RePO an attractive alternative to existing methods.

The development of RePO represents an important step in the optimization of large language models. By simplifying the optimization process and improving performance compared to existing methods, RePO contributes to making LLMs more usable for practical applications. The reduction to a single hyperparameter simplifies application and allows for more efficient adaptation to human preferences.

Outlook and Significance for the Future of AI

Research in the field of LLMs is progressing rapidly. Methods like RePO contribute to overcoming the challenges in optimizing these models and pave the way for more powerful and reliable AI systems. The improved adaptation to human preferences is crucial for the development of AI systems capable of solving complex tasks and enabling human-like interactions. Future research will show how RePO and similar approaches can be further developed to enhance the performance of LLMs even further and fully realize their potential.

Bibliographie: arxiv.org/html/2503.07426v1 paperreading.club/page?id=290452 chatpaper.com/chatpaper/fr?id=5&date=1741622400&page=1 chatpaper.com/chatpaper/zh-CN?id=5&date=1741622400&page=1 arxiv.org/abs/2402.10958 www.sciencedirect.com/science/article/pii/S0957417424007711 www.researchgate.net/publication/323956667_Deep_Learning_using_Rectified_Linear_Units_ReLU www.sciencedirect.com/science/article/pii/S0893608023005051 icml.cc/Downloads/2024 www.researchgate.net/publication/362948739_RELU-Function_and_Derived_Function_Review