Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided renderer syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.
翻译:音频驱动肖像动画旨在合成由给定音频条件驱动的肖像视频。生成高保真度和多模态视频肖像具有多种应用场景。以往的方法尝试通过训练不同模型或从给定视频中采样信号来捕捉不同的运动模式并生成高保真肖像视频。然而,唇部同步与其他运动(例如头部姿态/眨眼)之间缺乏关联学习,通常会导致不自然的结果。本文提出一个统一的系统,用于生成多人、多样化和高保真的说话肖像。该方法包含三个阶段:1)基于双注意力的一次映射网络(MODA),从给定音频生成说话表示。在MODA中,我们设计了一个双注意力模块,以编码精确的口部运动和多样化的模态;2)面部组合网络生成密集且精细的面部关键点;3)时间引导渲染器合成稳定的视频。大量评估表明,与以往方法相比,所提出的系统能生成更自然、更逼真的视频肖像。