Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference, limiting practical applications. Existing acceleration methods for general diffusion models fail to exploit the temporal and spatial redundancies unique to talking head generation. In this paper, we propose a task-specific framework addressing these inefficiencies through two key innovations. First, we introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP), caching static features to bypass most model layers in inference time. We also enable parallel prediction using cached features and estimated noisy latents as inputs, efficiently bypassing sequential sampling. Second, we propose Decoupled Foreground Attention (DFA) to further accelerate attention computations, exploiting the spatial decoupling in talking head videos to restrict attention to dynamic foreground regions. Additionally, we remove reference features in certain layers to bring extra speedup. Extensive experiments demonstrate that our framework significantly improves inference speed while preserving video quality.
翻译:基于扩散的说话头模型能够生成高质量、逼真的视频,但其推理速度缓慢,限制了实际应用。现有的通用扩散模型加速方法未能利用说话头生成特有的时空冗余性。本文提出一种针对特定任务的框架,通过两项关键创新解决这些效率问题。首先,我们引入基于闪电缓存的并行去噪预测(LightningCP),通过缓存静态特征在推理时跳过大部分模型层。我们还利用缓存特征和估计的噪声隐变量作为输入实现并行预测,有效规避了序列化采样。其次,我们提出解耦前景注意力机制(DFA),通过利用说话头视频的空间解耦特性将注意力计算限制在动态前景区域,从而进一步加速注意力计算。此外,我们移除了特定层中的参考特征以获取额外加速效果。大量实验表明,本框架在保持视频质量的同时显著提升了推理速度。