VGGT: A New Transformer-Based Approach to 3D Computer Vision

A New Approach in 3D Computer Vision: VGGT

3D computer vision has made enormous progress in recent years. Traditionally, however, models have mostly focused on individual tasks, such as estimating depth maps or camera parameters. A new approach, presented in the paper "VGGT: Visual Geometry Grounded Transformer", now promises a more comprehensive solution. VGGT, a feedforward neural network, is able to simultaneously derive all important 3D attributes of a scene from one, a few, or even hundreds of views.

These attributes include extrinsic and intrinsic camera parameters, depth maps, point clouds, and 3D point tracking. The special feature of VGGT lies in its efficiency and holistic approach. In contrast to previous methods, which often require complex post-processing steps with geometric optimization procedures, VGGT delivers the results directly and in seconds. This opens up new possibilities for real-time applications in areas such as robotics, augmented reality, and autonomous driving.

Superior Performance and Versatile Applications

In various 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking, VGGT achieves state-of-the-art results. Comparisons with existing methods show that VGGT is not only faster but also more accurate. Furthermore, VGGT can be used as a pre-trained feature backbone for downstream tasks, such as non-rigid point tracking and feedforward synthesis of new views. Initial results suggest that VGGT can also lead to significant improvements in these areas.

Architecture and Functionality

VGGT is based on the Transformer architecture, which has already proven itself in other areas of artificial intelligence. By integrating geometric information into the network, VGGT can effectively capture and evaluate the spatial relationships between the different views of a scene. The architecture of the network allows it to process both global and local information, resulting in a robust and accurate 3D reconstruction.

Availability and Outlook

The code and models for VGGT are publicly available, which promotes further research and development in this area. The authors of the paper see VGGT as an important step towards a more comprehensive and efficient 3D computer vision. Future work could focus on improving the accuracy and speed of the network, as well as expanding the application possibilities. The development of specialized versions for specific applications, such as medical imaging or 3D modeling, is also conceivable.

Bibliographie: Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., & Novotny, D. (2025). VGGT: Visual Geometry Grounded Transformer. *arXiv preprint arXiv:2503.11651*. https://arxiv.org/abs/2503.11651 https://github.com/facebookresearch/vggt https://huggingface.co/papers/2503.11651 https://arxiv.org/html/2503.11651v1 https://vgg-t.github.io/ https://jytime.github.io/ https://x.com/vfx_ai?lang=de https://huggingface.co/spaces/facebook/vggt https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_VGGSfM_Visual_Geometry_Grounded_Deep_Structure_From_Motion_CVPR_2024_paper.pdf