Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
翻译:近年来,视频扩散模型的发展为基于音频驱动的逼真说话视频生成开辟了新的可能性。然而,在生成的说话视频中实现流畅的音频-口型同步、保持长期的身份一致性以及产生自然且与音频对齐的面部表情,仍然是重大的挑战。为解决这些挑战,我们提出了基于记忆引导的情感感知扩散模型(MEMO),这是一种端到端的音频驱动肖像动画方法,用于生成身份一致且富有表现力的说话视频。我们的方法围绕两个核心模块构建:(1)记忆引导时序模块,通过建立记忆状态来存储更长时间跨度内的上下文信息,并利用线性注意力来指导时序建模,从而增强长期身份一致性与运动平滑性;(2)情感感知音频模块,该模块采用多模态注意力替代传统的交叉注意力以增强音频-视频交互,同时从音频中检测情感,并通过情感自适应层归一化来优化面部表情。大量的定量与定性实验结果表明,MEMO能够针对多样化的图像和音频类型生成更逼真的说话视频,在整体质量、音频-口型同步性、身份一致性和表情-情感对齐方面均优于现有先进方法。