In this paper, we propose a novel language-guided 3D arbitrary neural style transfer method (CLIP3Dstyler). We aim at stylizing any 3D scene with an arbitrary style from a text description, and synthesizing the novel stylized view, which is more flexible than the image-conditioned style transfer. Compared with the previous 2D method CLIPStyler, we are able to stylize a 3D scene and generalize to novel scenes without re-train our model. A straightforward solution is to combine previous image-conditioned 3D style transfer and text-conditioned 2D style transfer \bigskip methods. However, such a solution cannot achieve our goal due to two main challenges. First, there is no multi-modal model matching point clouds and language at different feature scales (low-level, high-level). Second, we observe a style mixing issue when we stylize the content with different style conditions from text prompts. To address the first issue, we propose a 3D stylization framework to match the point cloud features with text features in local and global views. For the second issue, we propose an improved directional divergence loss to make arbitrary text styles more distinguishable as a complement to our framework. We conduct extensive experiments to show the effectiveness of our model on text-guided 3D scene style transfer.
翻译:本文提出了一种新颖的语言引导三维任意神经风格迁移方法(CLIP3Dstyler)。我们的目标是通过文本描述对任意三维场景施加任意风格,并合成新颖的风格化视图,这比基于图像的风格迁移更为灵活。与先前二维方法CLIPStyler相比,我们能够对三维场景进行风格化,并在无需重新训练模型的情况下泛化至新颖场景。一个直接思路是结合已有的基于图像的三维风格迁移与基于文本的二维风格迁移。然而,由于两大挑战,该方案难以实现目标。第一,缺乏在多尺度特征(低层级、高层级)上匹配点云与语言的多模态模型。第二,当使用不同文本提示中的风格条件对内容进行风格化时,我们观察到风格混合问题。针对第一个问题,我们提出了一种三维风格化框架,在局部与全局视角下匹配点云特征与文本特征。针对第二个问题,我们提出了一种改进的方向散度损失,用于提升任意文本风格的可区分性,作为框架的补充。大量实验证明了我们的模型在文本引导的三维场景风格迁移中的有效性。