We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present SPiC-E - a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities (in particular, paired input and guidance 3D shapes) to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that SPiC-E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task.
翻译:我们正见证着基于预训练文图扩散模型的3D资产自动生成与编辑技术的飞速发展。然而,每个样本的合成仍需耗时耗力的优化流程,这阻碍了3D内容创作民主化的潜力。相反,3D扩散模型如今已在百万级3D数据集上完成训练,能够在数秒内生成高质量的文本条件化3D样本。本文提出SPiC-E——一种为3D扩散模型添加结构引导的神经网络,将其应用范围从文本条件生成拓展至更广泛的场景。该框架的核心在于引入跨实体注意力机制,使得多个实体(特别是成对的输入与引导3D形状)能够通过去噪网络中的内部表征进行交互。我们利用这一机制,从辅助引导形状中为3D扩散模型学习特定任务的结构先验。实验表明,该方法支持多种应用场景,包括3D风格化、语义形状编辑以及将基元抽象转化为高表现力形状的文本条件抽象到3D生成。大量实验证明,SPiC-E在这些任务中均达到最优性能,且通常比替代方法快得多。值得注意的是,这一成果无需针对特定任务调整我们的方法即可实现。