Reconstructing the 3D shape of an object from a single RGB image is a long-standing and highly challenging problem in computer vision. In this paper, we propose a novel method for single-image 3D reconstruction which generates a sparse point cloud via a conditional denoising diffusion process. Our method takes as input a single RGB image along with its camera pose and gradually denoises a set of 3D points, whose positions are initially sampled randomly from a three-dimensional Gaussian distribution, into the shape of an object. The key to our method is a geometrically-consistent conditioning process which we call projection conditioning: at each step in the diffusion process, we project local image features onto the partially-denoised point cloud from the given camera pose. This projection conditioning process enables us to generate high-resolution sparse geometries that are well-aligned with the input image, and can additionally be used to predict point colors after shape reconstruction. Moreover, due to the probabilistic nature of the diffusion process, our method is naturally capable of generating multiple different shapes consistent with a single input image. In contrast to prior work, our approach not only performs well on synthetic benchmarks, but also gives large qualitative improvements on complex real-world data.
翻译:从单张RGB图像重建物体的三维形状是计算机视觉领域一项长期且极具挑战性的问题。本文提出了一种新颖的单图像三维重建方法,通过条件去噪扩散过程生成稀疏点云。该方法以单张RGB图像及其相机姿态为输入,逐步将初始从三维高斯分布随机采样的3D点位置去噪为物体的形状。本方法的核心是一种称为投影条件(projection conditioning)的几何一致性条件化过程:在扩散过程的每一步,我们将局部图像特征从给定相机姿态投影到部分去噪的点云上。这种投影条件化过程使我们能够生成与输入图像良好对齐的高分辨率稀疏几何结构,还可用于在形状重建后预测点的颜色。此外,由于扩散过程的概率特性,本方法自然能够生成与单张输入图像一致的多种不同形状。与现有工作相比,我们的方法不仅在合成基准测试上表现优异,在复杂真实世界数据上也能获得显著的定性改进。