Audio-driven facial animation has made significant progress in multimedia applications, with diffusion models showing strong potential for talking-face synthesis. However, most existing works treat speech features as a monolithic representation and fail to capture their fine-grained roles in driving different facial motions, while also overlooking the importance of modeling keyframes with intense dynamics. To address these limitations, we propose KSDiff, a Keyframe-Augmented Speech-Aware Dual-Path Diffusion framework. Specifically, the raw audio and transcript are processed by a Dual-Path Speech Encoder (DPSE) to disentangle expression-related and head-pose-related features, while an autoregressive Keyframe Establishment Learning (KEL) module predicts the most salient motion frames. These components are integrated into a Dual-path Motion generator to synthesize coherent and realistic facial motions. Extensive experiments on HDTF and VoxCeleb demonstrate that KSDiff achieves state-of-the-art performance, with improvements in both lip synchronization accuracy and head-pose naturalness. Our results highlight the effectiveness of combining speech disentanglement with keyframe-aware diffusion for talking-head generation. The demo page is available at: https://kincin.github.io/KSDiff/.
翻译:音频驱动的面部动画在多媒体应用中取得了显著进展,扩散模型在说话人脸合成中展现出强大潜力。然而,现有方法大多将语音特征视为整体表示,未能捕捉其驱动不同面部运动的细粒度作用,同时忽视了对具有剧烈动态的关键帧建模的重要性。为解决这些限制,我们提出KSDiff,一种关键帧增强的语音感知双路径扩散框架。具体而言,原始音频和转录内容通过双路径语音编码器(DPSE)处理,以解耦表情相关和头部姿态相关的特征,同时自回归关键帧建立学习(KEL)模块预测最显著的运动帧。这些组件被集成到双路径运动生成器中,以合成连贯且逼真的面部运动。在HDTF和VoxCeleb上的大量实验表明,KSDiff实现了最先进的性能,在唇形同步精度和头部姿态自然度方面均有提升。我们的结果强调了将语音解耦与关键帧感知扩散相结合用于说话人脸生成的有效性。演示页面请访问:https://kincin.github.io/KSDiff/。