In recent years, Denoising Diffusion Probabilistic Models (DDPMs) have demonstrated exceptional performance in various 2D generative tasks. Following this success, DDPMs have been extended to 3D shape generation, surpassing previous methodologies in this domain. While many of these models are unconditional, some have explored the potential of using guidance from different modalities. In particular, image guidance for 3D generation has been explored through the utilization of CLIP embeddings. However, these embeddings are designed to align images and text, and do not necessarily capture the specific details needed for shape generation. To address this limitation and enhance image-guided 3D DDPMs with augmented 3D understanding, we introduce CISP (Contrastive Image-Shape Pre-training), obtaining a well-structured image-shape joint embedding space. Building upon CISP, we then introduce IC3D, a DDPM that harnesses CISP's guidance for 3D shape generation from single-view images. This generative diffusion model outperforms existing benchmarks in both quality and diversity of generated 3D shapes. Moreover, despite IC3D's generative nature, its generated shapes are preferred by human evaluators over a competitive single-view 3D reconstruction model. These properties contribute to a coherent embedding space, enabling latent interpolation and conditioned generation also from out-of-distribution images. We find IC3D able to generate coherent and diverse completions also when presented with occluded views, rendering it applicable in controlled real-world scenarios.
翻译:近年来,去噪扩散概率模型(DDPMs)在各种二维生成任务中展现出卓越性能。延续这一成功,DDPMs已被扩展至三维形状生成领域,超越此前方法论在该领域的技术水平。尽管多数此类模型是无条件的,部分研究探索了利用不同模态引导的潜力。其中,通过利用CLIP嵌入已探索了图像引导的三维生成方法。然而,这些嵌入本质上是为对齐图像与文本而设计的,未必能捕捉形状生成所需的特定细节。为解决这一局限并增强图像引导的三维DDPMs的三维理解能力,我们提出CISP(对比图像-形状预训练),构建了一个结构化的图像-形状联合嵌入空间。基于CISP,我们进一步提出IC3D——一种利用CISP引导从单视图图像生成三维形状的DDPM。该生成扩散模型在生成三维形状的质量与多样性方面均优于现有基准。此外,尽管IC3D本质上是生成模型,人类评估者对其生成形状的偏好程度仍超过具有竞争力的单视图三维重建模型。这些特性共同构建了连贯的嵌入空间,使得潜在插值以及基于分布外图像的条件生成成为可能。我们发现IC3D在面对遮挡视图时仍能生成连贯且多样的补全结果,使其适用于受控的真实场景。