Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information. Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data. However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions, (2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency. We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency. Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions. Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds. Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians. Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.
翻译:单图像三维重建由于固有的几何模糊性和有限的视角信息,一直是计算机视觉领域的一项基础性挑战。潜在视频扩散模型(LVDMs)的最新进展提供了从大规模视频数据中学习到的、前景广阔的3D先验。然而,有效利用这些先验面临三个关键挑战:(1)在大范围相机运动下质量下降,(2)难以实现精确的相机控制,以及(3)扩散过程固有的几何失真会破坏3D一致性。我们通过提出LiftImage3D框架来解决这些挑战,该框架在确保3D一致性的同时,有效释放了LVDMs的生成先验。具体而言,我们设计了一种关节化轨迹策略来生成视频帧,该策略将具有大范围相机运动的视频序列分解为具有可控小运动的序列。然后,我们使用鲁棒的神经匹配模型(即MASt3R)来校准生成帧的相机位姿并生成相应的点云。最后,我们提出了一种失真感知的3D高斯溅射表示,该表示能够学习帧之间的独立失真并输出无失真的规范高斯表示。大量实验表明,LiftImage3D在两个具有挑战性的数据集(即LLFF、DL3DV和Tanks and Temples)上实现了最先进的性能,并且能够很好地泛化到多样化的真实世界图像,从卡通插图到复杂的真实场景。