MVGD: Novel Multi-View Diffusion Model for 3D Scene Reconstruction

New Possibilities for 3D Scene Reconstruction: Direct Pixel Generation with MVGD

Reconstructing three-dimensional scenes from a limited number of images with known camera positions is a central challenge in computer vision. Existing methods often rely on intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussian models to ensure the consistency of appearance and geometry across different viewpoints. A new method called MVGD (Multi-View Geometric Diffusion) takes a different approach: direct pixel generation of images and depth maps from new viewpoints using diffusion models.

MVGD enables the generation of images and depth maps directly at the pixel level, based on any number of input images. At the heart of the method is the use of so-called raymaps. These maps augment visual features with spatial information from different viewpoints and guide the generation of images and depth maps from new perspectives. A crucial aspect of MVGD is its multitasking capability: images and depth maps are generated simultaneously, with learnable task embeddings guiding the diffusion process towards the respective modality.

MVGD was trained on a large dataset with over 60 million multi-view samples from publicly available sources. To ensure efficient and consistent learning under such heterogeneous conditions, special techniques were developed. An innovative strategy also enables the efficient training of larger models through incremental fine-tuning of smaller models, which shows promising scaling behavior.

Convincing Results and Versatile Applications

Comprehensive experiments demonstrate the performance of MVGD. The method achieved state-of-the-art results in various benchmarks for novel view synthesis. Furthermore, MVGD also shows promising results in the fields of multi-view stereo and video depth estimation. The direct pixel generation allows for a detailed and realistic representation of scenes from new viewpoints, which is relevant for numerous applications.

The development of MVGD opens up new perspectives for 3D scene reconstruction. The combination of diffusion models with raymap conditioning and multitasking generation enables efficient and accurate representation of scenes from arbitrary viewpoints. The promising results in various benchmarks highlight the potential of MVGD for future applications in areas such as virtual reality, robotics, and autonomous driving.

For companies like Mindverse, which specialize in AI-powered content creation and customized solutions, methods like MVGD offer exciting possibilities. The technology could be integrated, for example, into the development of realistic virtual environments, interactive 3D models, or advanced image editing tools. The ability to reconstruct detailed 3D scenes from a few images opens up new avenues for content creation and analysis.

Bibliography: - https://arxiv.org/abs/2501.18804 - https://arxiv.org/html/2501.18804v1 - https://mvgd.github.io/ - https://x.com/zhenjun_zhao/status/1886271208158855413 - https://openreview.net/pdf/602ea861b5b36b0a3dcfc719358d1cd004d5ca88.pdf - https://openreview.net/forum?id=zDJf7fvdid - https://openaccess.thecvf.com/content/CVPR2023/papers/Deng_NeRDi_Single-View_NeRF_Synthesis_With_Language-Guided_Diffusion_As_General_Image_CVPR_2023_paper.pdf - https://jmhb0.github.io/view_neti/ - https://www.researchgate.net/publication/385749761_Novel_View_Synthesis_with_Pixel-Space_Diffusion_Models - https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00150.pdf ```