Being-0: Hierarchical AI Agent Controls Humanoid Robot

Humanoid Robots: Being-0 Masters Complex Tasks

The development of autonomous robots that achieve human capabilities in real-world environments is a central goal of robotics research. A promising approach in this area is Being-0, a hierarchical agent framework that combines large language models (Foundation Models, FMs) with a modular skill library. Being-0 demonstrates how complex, multi-step tasks requiring navigation and manipulation can be effectively solved.

The Architecture of Being-0

Being-0 is based on a three-tiered architecture. The top level is formed by the FM, which is responsible for high-level cognitive tasks. These include understanding instructions, planning tasks, and reasoning. The lowest level consists of the skill library, which provides stable motion sequences and skillful manipulations for controlling the robot. The connection between these two levels is the so-called Connector module, which is powered by a lean Vision-Language Model (VLM). The Connector translates language-based plans into executable skill commands and dynamically coordinates locomotion and manipulation to optimize task completion.

The Role of the Connector

The Connector plays a crucial role in the efficiency and robustness of Being-0. By translating abstract plans into concrete actions, it enables the robot to perform complex tasks in the real world. The dynamic coordination of movements and manipulations helps to minimize errors and increase the success rate of task completion. Another advantage of the Connector is its ability to consider the latency of different modules, thus ensuring smooth execution of tasks.

Efficiency through Onboard Processing

A notable feature of Being-0 is the ability to run all components, except the FM, on cost-effective onboard computing units. This enables efficient real-time performance on a full-size humanoid robot equipped with dexterous hands and active image processing. Onboard processing reduces reliance on external computing resources and increases the robot's autonomy.

Experiments and Results

To demonstrate the capabilities of Being-0, extensive experiments were conducted in large indoor spaces. The results show that Being-0 can successfully handle complex, multi-step tasks that require demanding navigation and manipulation tasks. The combination of FM, skill library, and Connector proves to be an effective strategy for developing autonomous robots.

Future Perspectives

Being-0 represents an important step towards the development of humanoid robots that can achieve human capabilities in real-world environments. The hierarchical architecture and the integration of FMs, VLMs, and modular skills offer a promising framework for future research in this area. Further research could focus on improving the robustness and adaptability of the system in dynamic environments, as well as expanding the skill library with additional complex capabilities.

Bibliography: Yuan, H., Bai, Y., Fu, Y., Zhou, B., Feng, Y., Xu, X., Zhan, Y., Karlsson, B. F., & Lu, Z. (2025). Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills. arXiv preprint arXiv:2503.12533. Huang, W., Cheng, Y., & Slotine, J. J. E. (2023). Learning inverse kinematics for humanoid robots by trajectory optimization. In 2023 IEEE International Conference on Robotics and Automation (ICRA) (pp. 12274-12280). IEEE. Toussaint, M. (2020). Differentiable physics and robotics. In Robotics: Science and Systems XVI. Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., & Srinivas, A. (2024). Rt-1: Robotics transformer for real-world control at scale. In Conference on Robot Learning. PMLR. Sun, Y., Wang, Y., Zhan, W., & Xie, L. (2024). Multimodal prompt engineering for robotics: A survey. arXiv preprint arXiv:2405.14093. Li, M., Zhou, Z., Du, Y., Zhao, H., & Wen, J. (2023). A survey on vision-language pre-trained models. arXiv preprint arXiv:2311.07226. Gojayev, F. (2024). Awesome-vlm-architectures. https://github.com/gokayfem/awesome-vlm-architectures Georgia Tech Robot Intelligence and Perception Lab. (2024). Awesome-llm-robotics. https://github.com/GT-RIPL/Awesome-LLM-Robotics Lee, K., Laskin, M., Abbeel, P., & Srinivas, A. (2024). Robotics transformer-x: Generalising rt-1 robotics transformer with tools, hands, and visual-language multi-task learning. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. Muratore, F., Mousavian, A., Toshev, A., & Agrawal, P. (2024, October). Learning multi-step manipulation tasks with world models and large language models. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. Huang, W., Cheng, Y., & Slotine, J. J. (2024). Adaptive control of underactuated humanoid robots with multi-objective task optimization. Autonomous Robots, 48, 1013-1030.