We present a novel diffusion-based approach for coherent 3D scene reconstruction from a single RGB image. Our method utilizes an image-conditioned 3D scene diffusion model to simultaneously denoise the 3D poses and geometries of all objects within the scene. Motivated by the ill-posed nature of the task and to obtain consistent scene reconstruction results, we learn a generative scene prior by conditioning on all scene objects simultaneously to capture the scene context and by allowing the model to learn inter-object relationships throughout the diffusion process. We further propose an efficient surface alignment loss to facilitate training even in the absence of full ground-truth annotation, which is common in publicly available datasets. This loss leverages an expressive shape representation, which enables direct point sampling from intermediate shape predictions. By framing the task of single RGB image 3D scene reconstruction as a conditional diffusion process, our approach surpasses current state-of-the-art methods, achieving a 12.04% improvement in AP3D on SUN RGB-D and a 13.43% increase in F-Score on Pix3D.
翻译:本文提出一种新颖的基于扩散的方法,用于从单张RGB图像实现连贯的三维场景重建。我们的方法利用图像条件化的三维场景扩散模型,同时对场景中所有物体的三维位姿与几何结构进行去噪。针对该任务的不适定性特征,并为获得一致的场景重建结果,我们通过学习生成式场景先验来实现:通过对所有场景物体同时施加条件以捕捉场景上下文,并允许模型在整个扩散过程中学习物体间关联关系。我们进一步提出一种高效的表面对齐损失函数,即使在缺乏完整真实标注(这在公开数据集中较为常见)的情况下也能促进训练。该损失函数利用富有表现力的形状表征,使得能够直接从中间形状预测结果中进行点采样。通过将单张RGB图像的三维场景重建任务构建为条件扩散过程,我们的方法超越了当前最先进技术,在SUN RGB-D数据集上实现了12.04%的AP3D提升,在Pix3D数据集上获得了13.43%的F-Score增长。