We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present Spice-E - a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities (in particular, paired input and guidance 3D shapes) to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that Spice-E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task.
翻译:得益于预训练文本-图像扩散模型的可用性,我们正见证着3D资产生成与操控技术的飞速发展。然而,每个样本的合成仍需耗时优化过程,这阻碍了3D内容创作民主化的潜力。相比之下,现如今的3D扩散模型已在百万级3D数据集上训练,能够在数秒内生成高质量的文本条件3D样本。本文提出Spice-E——一种为3D扩散模型添加结构引导的神经网络,将其应用扩展至文本条件生成之外。其核心框架引入跨实体注意力机制,允许多个实体(特别是成对的输入与引导3D形状)通过去噪网络中的内部表征进行交互。我们利用该机制从辅助引导形状中学习3D扩散模型的特定任务结构先验。实验表明,我们的方法支持多种应用,包括3D风格化、语义形状编辑及文本条件的抽象体到3D转换(将基于基元的抽象体转化为高表达力的形状)。大量实验证明,Spice-E在各项任务中均达到最先进性能,且往往比替代方法快得多。重要的是,这些成果无需针对特定任务定制方法即可实现。