AI Agents Enhanced with Segment-Level Preference Optimization

AI-Driven Social Agents: A New Approach to Dialogue Optimization

The development of social agents powered by large language models (LLMs) has made significant strides in recent years. These agents can simulate human social behaviors and be deployed in various interactive scenarios. Despite their ability to conduct simple conversations and simulate basic social interactions, LLMs encounter limitations in more complex, goal-oriented social dialogues, such as negotiations or collaborations.

A promising approach to improving the performance of LLMs in such scenarios is Direct Preference Optimization (DPO). DPO methods aim to align the behavior of LLMs with human preferences. Existing DPO-based approaches for multi-turn interactions can be categorized into two types: turn-level and session-level methods. Turn-level DPO focuses on individual conversation turns, while session-level DPO considers entire conversations. However, both approaches have limitations. Turn-level DPO is often too granular and neglects the overall context of the conversation. Session-level DPO, on the other hand, is too coarse and can lead to undesirable noise in the training process by incorporating irrelevant conversation turns.

SDPO: Segment-Level Direct Preference Optimization

To overcome the weaknesses of existing DPO methods, a new approach has been developed: Segment-Level Direct Preference Optimization (SDPO). SDPO focuses on specific, relevant segments within an interaction to optimize multi-turn agent behavior while minimizing training noise.

The SDPO approach first identifies the flawed turn in a negative session. Then, the interaction history before this turn is used to generate multiple positive sessions. The first diverging turn serves as the starting point for identifying the key segment in the positive session that leads to a better outcome. Subsequently, data pairs are formed by extracting the corresponding segment of the same length from the negative session. Finally, a customized DPO loss function is calculated for the turns within the segments.

By focusing on relevant segments, SDPO minimizes training noise caused by irrelevant turns. At the same time, the action space of the conversational partner is narrowed, increasing the likelihood that the generated positive sessions contain the correct agent behavior patterns.

Evaluation and Results

The effectiveness of SDPO was evaluated using the SOTOPIA benchmark, an open and interactive benchmark for social intelligence. The results show that SDPO-trained agents outperform both existing DPO methods and proprietary LLMs like GPT-4o. This highlights the potential of SDPO to improve the social intelligence of LLM-based agents.

SDPO offers a flexible and general method for optimizing dialogues. The granularity of the optimization can be adapted to the specific data. Although SDPO was primarily developed to improve the social intelligence of agents, the approach can also be applied to other areas to enhance the capabilities of agents in various domains.

Future Research

Research in the field of AI-driven social agents is dynamic and promising. Further investigation is needed to fully exploit the potential of SDPO and similar methods. The development of more robust evaluation metrics and the expansion of application possibilities to other domains are important goals for future research.