Diffusion models generating images conditionally on text, such as Dall-E 2 and Stable Diffusion, have recently made a splash far beyond the computer vision community. Here, we tackle the related problem of generating point clouds, both unconditionally, and conditionally with images. For the latter, we introduce a novel geometrically-motivated conditioning scheme based on projecting sparse image features into the point cloud and attaching them to each individual point, at every step in the denoising process. This approach improves geometric consistency and yields greater fidelity than current methods relying on unstructured, global latent codes. Additionally, we show how to apply recent continuous-time diffusion schemes. Our method performs on par or above the state of art on conditional and unconditional experiments on synthetic data, while being faster, lighter, and delivering tractable likelihoods. We show it can also scale to diverse indoors scenes.
翻译:基于文本条件生成图像的扩散模型(如Dall-E 2和Stable Diffusion)近期在计算机视觉领域乃至更广范围产生了巨大轰动。本文研究了点云生成的相关问题,包括无条件生成和以图像为条件的条件生成。针对后者,我们提出了一种新颖的基于几何动机的条件化方案:在去噪过程的每一步中,将稀疏图像特征投影到点云中,并将其附着于每个单独点。该方法相比依赖非结构化全局潜编码的现有方法,显著提升了几何一致性并获得了更高的保真度。此外,我们展示了如何应用最新的连续时间扩散机制。在合成数据的条件生成与无条件生成实验中,我们的方法在性能上达到或超越现有最优水平,同时具备更快的速度、更轻量的架构以及可处理的似然函数。实验还表明该方法可扩展至多样化室内场景。