Self-Supervised Skill Discovery in Open Worlds from Unsegmented Demonstrations

Self-Supervised Learning for Skill Discovery in Open Worlds

Developing AI agents capable of handling complex tasks in open worlds presents a significant challenge. A promising approach involves teaching agents to learn basic skills and then combine them into more complex action sequences. Online demonstration videos, such as those from video games, offer immense potential for training such agents. However, these videos are typically long and unsegmented, making the identification and extraction of individual skills difficult.

Traditional methods for segmenting demonstration videos often rely on time-consuming manual annotation or complex sampling procedures. A new research approach pursues a self-supervised learning approach to divide long videos into semantically meaningful and skill-consistent segments. Inspired by the cognitive event segmentation theory in humans, the "Skill Boundary Detection" (SBD) algorithm was developed.

How Skill Boundary Detection (SBD) Works

SBD detects skill boundaries in a video by utilizing prediction errors of a pre-trained, unconditioned action prediction model. The underlying assumption is that a significant increase in the prediction error indicates a change in the executed skill. Simply put: if the model can no longer predict the next action in the video well, it is likely because a new skill has begun.

Evaluation in Minecraft

The effectiveness of SBD was evaluated in Minecraft, an open-world simulator with an abundance of online gameplay videos. The segments generated with SBD led to a significant improvement in the performance of conditioned policies – by 63.7% and 52.1% in short-term atomic skill tasks. In hierarchical agents that combine these skills for more complex, long-term tasks, an improvement of 11.3% and 20.8% was observed.

Potential for Instruction-Following Agents

The results suggest that SBD is a promising approach to harness the potential of diverse online video resources, such as YouTube, for training instruction-following agents. These agents could learn to execute complex instructions in an open world based on the segmented videos.

Outlook and Significance for AI Development

The development of methods like SBD is an important step towards more robust and flexible AI agents. The ability to learn from unsegmented demonstrations opens up new possibilities for training agents in complex, open worlds. This could lead to advancements in various application areas, from robotics and autonomous vehicles to virtual assistants. Research in this field is dynamic and promising, and further developments are eagerly awaited.

Bibliography: Deng, J., Wang, Z., Cai, S., Liu, A., & Liang, Y. (2025). Open-World Skill Discovery from Unsegmented Demonstrations. arXiv preprint arXiv:2503.10684. Zhu, H., Deng, J., Cai, S., Wang, Z., & Liang, Y. (2022). Bottom-Up Skill Discovery From Unsegmented Demonstrations for Long-Horizon Robot Manipulation. Robotics: Science and Systems.