Reconstructing the 3D shape of an object from a single RGB image is a long-standing and highly challenging problem in computer vision. In this paper, we propose a novel method for single-image 3D reconstruction which generates a sparse point cloud via a conditional denoising diffusion process. Our method takes as input a single RGB image along with its camera pose and gradually denoises a set of 3D points, whose positions are initially sampled randomly from a three-dimensional Gaussian distribution, into the shape of an object. The key to our method is a geometrically-consistent conditioning process which we call projection conditioning: at each step in the diffusion process, we project local image features onto the partially-denoised point cloud from the given camera pose. This projection conditioning process enables us to generate high-resolution sparse geometries that are well-aligned with the input image, and can additionally be used to predict point colors after shape reconstruction. Moreover, due to the probabilistic nature of the diffusion process, our method is naturally capable of generating multiple different shapes consistent with a single input image. In contrast to prior work, our approach not only performs well on synthetic benchmarks, but also gives large qualitative improvements on complex real-world data.
翻译:从单张RGB图像重建物体的三维形状是计算机视觉中一个长期且极具挑战性的问题。本文提出了一种新颖的单图像三维重建方法,该方法通过条件去噪扩散过程生成稀疏点云。我们的方法以单张RGB图像及其相机位姿作为输入,逐步将初始从三维高斯分布中随机采样的一组3D点去噪为物体的形状。该方法的核心是一个几何一致性条件化过程,我们称之为投影条件化:在扩散过程中的每一步,我们从给定相机位姿将局部图像特征投影到部分去噪的点云上。这一投影条件化过程使我们能够生成与输入图像高度对齐的高分辨率稀疏几何结构,并且还可用于在形状重建后预测点的颜色。此外,由于扩散过程的概率特性,我们的方法自然地能够生成与单张输入图像一致的多重不同形状。与先前工作相比,我们的方法不仅在合成基准测试上表现优异,而且在复杂真实世界数据上实现了显著的定性改进。