We aim to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details. The task can be decomposed into two sub-problems: (1) speech-driven lip motion generation and (2) visual appearance synthesis. Current solutions handle the two sub-problems within a single generative model, resulting in a challenging trade-off between lip-sync quality and visual details preservation. Instead, we propose to disentangle the motion and appearance, and then generate them one by one with a speech-to-motion diffusion model and a motion-conditioned appearance generation model. However, there still remain challenges in each stage, such as motion-aware identity preservation in (1) and visual details preservation in (2). Therefore, to preserve personal identity, we adopt landmarks to represent the motion, and further employ a landmark-based identity loss. To capture motion-agnostic visual details, we use separate encoders to encode the lip, non-lip appearance and motion, and then integrate them with a learned fusion module. We train MyTalk on a large-scale and diverse dataset. Experiments show that our method generalizes well to the unknown, even out-of-domain person, in terms of both lip sync and visual detail preservation. We encourage the readers to watch the videos on our project page (https://Ingrid789.github.io/MyTalk/).
翻译:本文旨在根据给定语音编辑说话视频中的唇部运动,同时保持人物身份与视觉细节。该任务可分解为两个子问题:(1) 语音驱动的唇部运动生成;(2) 视觉外观合成。现有解决方案通常在单一生成模型中处理这两个子问题,导致唇形同步质量与视觉细节保留之间难以权衡。为此,我们提出将运动与外观解耦,分别通过语音到运动的扩散模型和运动条件外观生成模型依次生成。然而,每个阶段仍存在挑战,例如阶段(1)中需保持运动感知的身份特征,阶段(2)中需保留视觉细节。为保持人物身份,我们采用面部关键点表示运动,并进一步设计基于关键点的身份损失函数。为捕捉与运动无关的视觉细节,我们使用独立编码器分别编码唇部区域、非唇部外观及运动信息,并通过学习的融合模块进行整合。我们在大规模多样化数据集上训练MyTalk模型。实验表明,本方法对未知(甚至域外)人物在唇形同步与视觉细节保留方面均表现出良好的泛化能力。建议读者访问项目页面(https://Ingrid789.github.io/MyTalk/)观看演示视频。