We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.
翻译:我们提出了Gen3R,一种将基础重建模型的强先验与视频扩散模型相结合,用于场景级三维生成的方法。我们重新利用VGGT重建模型,通过在其token上训练一个适配器来生成几何潜在表示,这些表示经过正则化以与预训练视频扩散模型的外观潜在表示对齐。通过联合生成这些解耦但对齐的潜在表示,Gen3R能够同时生成RGB视频及对应的三维几何信息,包括相机位姿、深度图和全局点云。实验表明,我们的方法在单图像和多图像条件的三维场景生成任务中取得了最先进的结果。此外,通过利用生成先验,我们的方法能够增强重建的鲁棒性,这证明了将重建模型与生成模型紧密耦合能够带来相互增益。