Speech-Preserving Facial Expression Manipulation (SPFEM) is an innovative technique aimed at altering facial expressions in images and videos while retaining the original mouth movements. Despite advancements, SPFEM still struggles with accurate lip synchronization due to the complex interplay between facial expressions and mouth shapes. Capitalizing on the advanced capabilities of audio-driven talking head generation (AD-THG) models in synthesizing precise lip movements, our research introduces a novel integration of these models with SPFEM. We present a new framework, Talking Head Facial Expression Manipulation (THFEM), which utilizes AD-THG models to generate frames with accurately synchronized lip movements from audio inputs and SPFEM-altered images. However, increasing the number of frames generated by AD-THG models tends to compromise the realism and expression fidelity of the images. To counter this, we develop an adjacent frame learning strategy that finetunes AD-THG models to predict sequences of consecutive frames. This strategy enables the models to incorporate information from neighboring frames, significantly improving image quality during testing. Our extensive experimental evaluations demonstrate that this framework effectively preserves mouth shapes during expression manipulations, highlighting the substantial benefits of integrating AD-THG with SPFEM.
翻译:语音保持面部表情操纵(SPFEM)是一种创新技术,旨在改变图像和视频中的面部表情,同时保留原始嘴部运动。尽管技术不断进步,但由于面部表情与嘴部形状之间复杂的相互作用,SPFEM在准确的唇部同步方面仍面临挑战。本研究利用音频驱动说话头部生成(AD-THG)模型在合成精确唇部运动方面的先进能力,首次提出将这些模型与SPFEM进行创新性整合。我们提出了一个新框架——说话头部面部表情操纵(THFEM),该框架利用AD-THG模型从音频输入和经SPFEM修改的图像中生成具有精确同步唇部运动的帧。然而,增加AD-THG模型生成的帧数往往会损害图像的真实感和表情保真度。为解决这一问题,我们开发了一种相邻帧学习策略,通过微调AD-THG模型来预测连续帧序列。该策略使模型能够整合相邻帧的信息,从而在测试阶段显著提升图像质量。我们的大量实验评估表明,该框架在表情操纵过程中能有效保持嘴部形状,凸显了将AD-THG与SPFEM整合所带来的显著优势。