Diffusion models generating images conditionally on text, such as Dall-E 2 and Stable Diffusion, have recently made a splash far beyond the computer vision community. Here, we tackle the related problem of generating point clouds, both unconditionally, and conditionally with images. For the latter, we introduce a novel geometrically-motivated conditioning scheme based on projecting sparse image features into the point cloud and attaching them to each individual point, at every step in the denoising process. This approach improves geometric consistency and yields greater fidelity than current methods relying on unstructured, global latent codes. Additionally, we show how to apply recent continuous-time diffusion schemes. Our method performs on par or above the state of art on conditional and unconditional experiments on synthetic data, while being faster, lighter, and delivering tractable likelihoods. We show it can also scale to diverse indoors scenes.
翻译:近年来,诸如Dall-E 2和Stable Diffusion等根据文本条件生成图像的扩散模型,其影响已远超计算机视觉领域。本文研究点云生成的相关问题,涵盖无条件生成及基于图像条件的有条件生成。针对后者,我们提出了一种新颖的几何驱动条件机制——在去噪过程的每一步中,将稀疏图像特征投影到点云并附着于每个独立点。该方法相比依赖非结构化全局隐编码的现有技术,显著提升了几何一致性及生成保真度。此外,我们展示了如何应用最新的连续时间扩散方案。在合成数据的条件与无条件实验中,我们的方法在性能上达到或超越现有最优技术,同时具备更快的速度、更轻量的架构,并能提供可处理的似然估计。该方案还可扩展至多样化室内场景。