Recent progress in the text-driven 3D stylization of a single object has been considerably promoted by CLIP-based methods. However, the stylization of multi-object 3D scenes is still impeded in that the image-text pairs used for pre-training CLIP mostly consist of an object. Meanwhile, the local details of multiple objects may be susceptible to omission due to the existing supervision manner primarily relying on coarse-grained contrast of image-text pairs. To overcome these challenges, we present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles under the contrast supervision at multiple levels. We first propose a Decoupled Graph Attention (DGA) module to distinguishably reinforce the features of 3D surface points. Particularly, a cross-modal graph is constructed to align the object points accurately and noun phrases decoupled from the 3D mesh and textual description. Then, we develop a Cross-Grained Contrast (CGC) supervision system, where a fine-grained loss between the words in the textual description and the randomly rendered images are constructed to complement the coarse-grained loss. Extensive experiments show that our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes. Our code and results will be made publicly available
翻译:近期,基于CLIP的方法显著推动了文本驱动的单个对象三维风格化研究。然而,由于CLIP预训练使用的图像-文本对主要包含单个对象,多对象三维场景的风格化仍面临阻碍。同时,现有监督方式主要依赖图像-文本对的粗粒度对比,可能导致多个对象的局部细节易被忽视。为克服这些挑战,我们提出了一种名为TeMO的新型框架,能够在多层级对比监督下解析多对象三维场景并编辑其风格。首先,我们提出解耦图注意力(DGA)模块,以区分性增强三维表面点的特征。具体而言,我们构建跨模态图,精准对齐从三维网格和文本描述中解耦出的对象点与名词短语。随后,我们开发了跨粒度对比(CGC)监督系统,该系统在文本描述中的词语与随机渲染图像之间构建细粒度损失,以补充粗粒度损失。大量实验表明,我们的方法能够合成高质量的风格化内容,并在多种多对象三维网格上优于现有方法。我们的代码和结果将公开发布。