World Models Enhance Embodied Task Planning with Dual Preference Optimization

World Models for Better Planning: Dual Preference Optimization in Embodied Task Planning

The ability to plan and execute complex tasks in real-world environments is a central research area in Artificial Intelligence. Large, multimodal language models (LVLMs) have shown significant progress in recent years in areas such as natural language processing and image analysis. These advancements open up new possibilities for their use in so-called "Embodied Task Planning," where AI agents must operate in simulated or real environments and solve tasks based on visual and linguistic information. Despite the potential, LVLMs face challenges in this area, particularly regarding dependency constraints between actions and planning efficiency.

A promising approach to address these challenges is the integration of world models. Traditionally, planning algorithms either focus exclusively on selecting the optimal action or only use world models during the inference phase. The benefits of learning a world model to improve planning capabilities are often overlooked. A new research paper now proposes an innovative approach: Dual Preference Optimization (D²PO).

D²PO: A New Approach to Planning Optimization

D²PO is a learning framework that jointly optimizes state change prediction and action selection through preference learning. By learning a world model, LVLMs can better understand the dynamics of the environment and thus create more effective plans. In contrast to conventional methods, which often rely on time-consuming manual annotation of trajectories and preference data, D²PO uses an automated approach. A tree search mechanism enables comprehensive exploration of the environment through trial-and-error. This allows trajectories and step-by-step preference data to be collected automatically without human intervention.

Experimental Results and Outlook

To evaluate the effectiveness of D²PO, extensive experiments were conducted on the VoTa-Bench, a benchmark for Embodied Task Planning. The results show that D²PO, in combination with various LVLMs such as Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B), achieves significantly better results than existing methods and even GPT-4o. D²PO not only achieves higher success rates in task completion but also generates more efficient execution paths. This suggests that learning a world model through D²PO significantly improves the planning capabilities of LVLMs.

The development of D²PO represents an important step towards more robust and efficient Embodied Task Planning. The ability to automatically collect preference data and integrate state prediction with action planning opens up new possibilities for the use of LVLMs in complex, real-world scenarios. Future research could focus on extending D²PO to more complex environments and investigating the scalability of the approach to even larger language models. The integration of world models into planning promises to further enhance the performance of AI agents in real-world applications.

Bibliographie: - https://huggingface.co/papers/2503.10480 - https://huggingface.co/papers - https://chatpaper.com/chatpaper/fr?id=3&date=1741881600&page=1 - https://arxiv.org/abs/2307.01848 - https://arxiv.org/html/2502.11221v1 - https://www.researchgate.net/publication/377425973_LLM-Planner_Few-Shot_Grounded_Planning_for_Embodied_Agents_with_Large_Language_Models - https://iclr.cc/Downloads/2024 - https://jmlr.org/tmlr/papers/ - https://aclanthology.org/2024.emnlp-main.367.pdf - https://openreview.net/forum?id=S1Bv3068Xt