Recent advancements in deep generative models, particularly with the application of CLIP (Contrastive Language Image Pretraining) to Denoising Diffusion Probabilistic Models (DDPMs), have demonstrated remarkable effectiveness in text to image generation. The well structured embedding space of CLIP has also been extended to image to shape generation with DDPMs, yielding notable results. Despite these successes, some fundamental questions arise: Does CLIP ensure the best results in shape generation from images? Can we leverage conditioning to bring explicit 3D knowledge into the generative process and obtain better quality? This study introduces CISP (Contrastive Image Shape Pre training), designed to enhance 3D shape synthesis guided by 2D images. CISP aims to enrich the CLIP framework by aligning 2D images with 3D shapes in a shared embedding space, specifically capturing 3D characteristics potentially overlooked by CLIP's text image focus. Our comprehensive analysis assesses CISP's guidance performance against CLIP guided models, focusing on generation quality, diversity, and coherence of the produced shapes with the conditioning image. We find that, while matching CLIP in generation quality and diversity, CISP substantially improves coherence with input images, underscoring the value of incorporating 3D knowledge into generative models. These findings suggest a promising direction for advancing the synthesis of 3D visual content by integrating multimodal systems with 3D representations.
翻译:近期深度生成模型的进展,特别是将CLIP(对比语言-图像预训练)应用于去噪扩散概率模型(DDPMs),已在文本到图像生成中展现出显著效果。CLIP良好的结构化嵌入空间也被扩展到基于DDPMs的图像到形状生成,并取得了显著成果。尽管取得了这些成功,但一些基本问题浮现:CLIP是否能确保从图像生成形状的最佳结果?我们能否利用条件约束将显式三维知识引入生成过程以获得更优质量?本研究提出CISP(对比图像-形状预训练),旨在增强由二维图像引导的三维形状合成。CISP通过将二维图像与三维形状在共享嵌入空间中对齐,特别捕捉CLIP因侧重文本-图像而可能忽略的三维特征,从而丰富CLIP框架。我们的综合分析评估了CISP相较于CLIP引导模型的指导性能,重点关注生成质量、多样性以及生成形状与条件图像的一致性。我们发现,CISP在生成质量和多样性上与CLIP相当的同时,显著提高了与输入图像的一致性,凸显了将三维知识融入生成模型的价值。这些发现为通过整合多模态系统与三维表示来推进三维视觉内容合成指明了有前景的方向。