Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.
翻译:近年来,情感说话人脸生成技术受到广泛关注。然而,现有方法仅采用独热编码、图像或音频作为情感条件,导致实际应用中缺乏灵活控制,且因语义受限而无法处理未见过的情感风格。现有方法要么忽略单样本设置,要么牺牲生成人脸的质量。本文提出一种更灵活且具有泛化性的框架。具体而言,我们在文本提示中补充情感风格,并使用对齐的多模态情感编码器将文本、图像和音频情感模态嵌入统一空间,该空间继承了CLIP的丰富语义先验。因此,有效的多模态情感空间学习使我们的方法在测试阶段能支持任意情感模态,并泛化至未见过的情感风格。此外,本文提出情感感知的音频到3DMM转换器,将情感条件与音频序列关联为结构表征;随后设计基于风格的高保真情感人脸生成器,以生成任意高分辨率逼真身份。我们的纹理生成器以残差方式分层学习流场与动画人脸。大量实验证明了本方法在情感控制中的灵活性与泛化能力,以及高质量人脸合成的有效性。