Aguvis: A New Vision-Based Approach to Cross-Platform GUI Automation

Automated GUI Interaction: A New Approach for Cross-Platform Vision Agents

Automating tasks on graphical user interfaces (GUIs) is a complex field, challenging due to the diversity of platforms and the visual variability of interfaces. Existing approaches, often based on text-based GUI representations, reach their limits when it comes to generalization, efficiency, and scalability. A new research approach pursues a purely visual approach for autonomous GUI agents that works across platforms.

Aguvis: A Unified Framework for Visual GUI Agents

Aguvis is a framework for autonomous GUI agents based solely on visual information. Instead of relying on textual descriptions of GUI elements, Aguvis uses image data as a foundation. Natural language instructions are directly linked to visual elements. A consistent action space ensures cross-platform generalization. This approach allows Aguvis to handle GUI tasks on various platforms, such as desktops, smartphones, and web browsers.

Planning and Reasoning: The Key to Autonomous Navigation

A central element of Aguvis is the integration of explicit planning and reasoning. Through the ability to analyze complex digital environments and plan sequences of actions, Aguvis can autonomously navigate and interact with GUIs. This distinguishes Aguvis from previous approaches, which were often limited to reactive interactions.

Data Foundation and Training: Two-Phase Pipeline for Optimal Performance

To ensure the performance of Aguvis, a comprehensive dataset of GUI agent trajectories was created, which includes multimodal reasoning and grounding. Training takes place in two phases: First, the model focuses on general GUI grounding, followed by planning and reasoning. This two-stage pipeline allows Aguvis to master both basic GUI interactions and more complex, multi-step tasks.

Evaluation and Results: Superiority over Existing Methods

Extensive experiments show that Aguvis outperforms previous state-of-the-art methods in offline and real-world online scenarios. To the best of our knowledge, Aguvis is the first fully autonomous, purely visual GUI agent capable of performing tasks without collaboration with external closed-source models.

Open Source: Promoting Future Research

To promote further research in this area, all datasets, models, and training recipes of Aguvis are made available open source. This allows other researchers to build on the results and further advance the development of autonomous GUI agents.

Potential and Outlook: Diverse Application Possibilities

The development of autonomous GUI agents like Aguvis opens up diverse application possibilities. From automating repetitive office tasks to supporting people with disabilities in using digital devices, the potential of this technology is enormous. Future research could focus on improving the robustness and adaptability of GUI agents in dynamic and unpredictable environments.

Bibliographie: Xu, Y. et al. (2024). Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. arXiv preprint arXiv:2412.04454. OpenReview.net. Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. ranpox. (n.d.). Awesome-computer-use. GitHub repository. ChatPaper. (2024). Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. Liu, X. et al. (2024). AutoGLM: Autonomous Foundation Agents for GUIs. arXiv preprint arXiv:2411.00820. showlab. (n.d.). Awesome-GUI-Agent. GitHub repository. Chen, W. et al. (2024). GUICourse: From General Vision Language Model to Versatile GUI Agent. arXiv preprint arXiv:2406.11317v1. Del Grosso, A. et al. (2023). Engineering modularity in biochemical networks using compartmentalization and spatial patterning. ACS Synthetic Biology. Sarkar, S. et al. (2021). A review of advances in the bio-inspired computation domain. European Biophysics Journal.