TBA: Scalable Reinforcement Learning for Large Language Models

Efficient Reinforcement Learning for Large Language Models: Trajectory Balance with Asynchrony (TBA)

Reinforcement learning (RL) has established itself as an essential component in the post-training of large language models (LLMs). It allows LLMs to optimize their behavior through interaction with an environment, thereby improving their performance in various tasks. However, conventional on-policy RL algorithms, often used in LLM post-training, are limited in their scalability and difficult to combine with so-called experience replay buffers. These buffers store experiences from past interactions and can be filled asynchronously by distributed actors to improve the exploration of model behavior and expand the training database. A novel approach called Trajectory Balance with Asynchrony (TBA) addresses these challenges and enables massively scalable RL for LLMs.

TBA is characterized by an efficient use of computational resources by dedicating a larger portion of the computing power to exploration, i.e., the search for new strategies. Distributed actors continuously generate off-policy data, which is collected in a central replay buffer. In parallel, a training node retrieves data from this buffer, based on its reward value or recency, to update the LLM's policy using Trajectory Balance (TB). TB, originally developed for GFlowNets, is an RL objective that promotes the diversity of learned strategies.

This asynchronous approach decouples training from exploration and leads to a significant acceleration of the training process. Compared to conventional methods, TBA can reduce training time by a factor of four or more. At the same time, the training benefits from the improved diversity of the data in the replay buffer, enabled by the large-scale off-policy collection. Furthermore, the scalable exploration of TBA proves particularly advantageous in scenarios with sparse rewards, where traditional methods often struggle to find optimal strategies.

Advantages of TBA at a Glance

TBA offers three key advantages compared to existing RL methods for LLMs:

1. **Decoupling of Training and Exploration:** The asynchronous architecture allows training to occur in parallel with data generation, leading to a significant acceleration of the training process. 2. **Improved Diversity:** The use of a central replay buffer, filled with off-policy data, enables a greater variety of training data and promotes the exploration of different strategies. 3. **Scalable Exploration:** TBA is particularly suitable for environments with sparse rewards, as exploration can be efficiently scaled by the distributed actors.

Applications and Results

The effectiveness of TBA has been evaluated in various LLM post-training tasks, including mathematical reasoning, preference tuning, and automated red-teaming. In all these areas, TBA showed both speed and performance improvements over established baselines. For example, in a setup similar to that of VinePPO for the GSM8K dataset, an accuracy increase of 1.2% was achieved with a 50-fold acceleration of RL training.

The results suggest that TBA is a promising approach for scalable RL of LLMs and has the potential to drive the development of more powerful and efficient language models. The decoupling of training and exploration, the improved diversity of training data, and the scalable exploration make TBA an attractive alternative to conventional on-policy RL methods.