Domain adaptation of 3D portraits has gained more and more attention. However, the transfer mechanism of existing methods is mainly based on vision or language, which ignores the potential of vision-language combined guidance. In this paper, we propose an Image-Text multi-modal framework, namely Image and Text portrait (ITportrait), for 3D portrait domain adaptation. ITportrait relies on a two-stage alternating training strategy. In the first stage, we employ a 3D Artistic Paired Transfer (APT) method for image-guided style transfer. APT constructs paired photo-realistic portraits to obtain accurate artistic poses, which helps ITportrait to achieve high-quality 3D style transfer. In the second stage, we propose a 3D Image-Text Embedding (ITE) approach in the CLIP space. ITE uses a threshold function to self-adaptively control the optimization direction of images or texts in the CLIP space. Comprehensive experiments prove that our ITportrait achieves state-of-the-art (SOTA) results and benefits downstream tasks. All source codes and pre-trained models will be released to the public.
翻译:三维肖像的领域自适应日益受到关注。然而,现有方法的迁移机制主要基于视觉或语言,忽略了视觉-语言联合指导的潜力。本文提出一种图文多模态框架——ITportrait (图像与文本肖像),用于三维肖像领域自适应。ITportrait依赖于两阶段交替训练策略。第一阶段,我们采用三维艺术配对迁移(APT)方法进行图像引导的风格迁移。APT通过构建配对的逼真肖像来获取精确的艺术姿态,这有助于ITportrait实现高质量的三维风格迁移。第二阶段,我们在CLIP空间提出三维图文嵌入(ITE)方法。ITE利用阈值函数自适应控制CLIP空间中图像或文本的优化方向。综合实验证明,我们的ITportrait达到了最先进(SOTA)结果,并能助力下游任务。所有源代码和预训练模型将公开发布。