We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present SPiC-E - a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities (in particular, paired input and guidance 3D shapes) to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that SPiC-E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task.
翻译:随着预训练文本-图像扩散模型的出现,3D资产的自动生成与操控正经历快速发展。然而,每个样本的合成仍需耗时优化流程,这阻碍了3D内容创作民主化的潜力。相比之下,3D扩散模型如今可在百万级3D数据集上训练,在数秒内生成高质量的文本条件3D样本。本文提出SPiC-E——一种为3D扩散模型添加结构引导的神经网络,将其应用范围扩展至文本条件生成之外。其核心在于引入跨实体注意力机制,允许多个实体(特别是成对的输入引导3D形状)通过去噪网络中的内部表征进行交互。我们利用该机制从辅助引导形状中学习3D扩散模型的特定任务结构先验。实验表明,本方法支持多种应用,包括3D风格化、语义形状编辑以及将基于基元的抽象形状转化为高表达性形状的文本条件抽象到3D转换。大量实验证明,SPiC-E在各项任务中达到最优性能,且速度显著优于替代方法。重要的是,这一成果无需针对特定任务进行定制化调整即可实现。