Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.
翻译:三维场景重建技术的进步已将现实世界中的二维图像转化为三维模型,能够从数百张输入照片生成逼真的三维结果。尽管在密集视角重建场景中取得了巨大成功,但从不足的捕获视角渲染细节丰富的场景仍是一个不适定的优化问题,通常会导致未观测区域出现伪影和畸变。本文提出ReconX,一种新颖的三维场景重建范式,将这一模糊的重建挑战重新定义为时序生成任务。其核心洞见在于释放大型预训练视频扩散模型的强大生成先验,以用于稀疏视角重建。然而,三维视角一致性难以在预训练模型直接生成的视频帧中得到准确保持。为解决此问题,在给定有限输入视角的条件下,所提出的ReconX首先构建全局点云并将其编码至上下文空间作为三维结构条件。在该条件引导下,视频扩散模型合成出既保持细节又具有高度三维一致性的视频帧,确保场景在不同视角下的连贯性。最后,我们通过置信度感知的三维高斯溅射优化方案从生成的视频中恢复三维场景。在多个真实世界数据集上的大量实验表明,我们的ReconX在质量和泛化能力方面均优于现有最先进方法。