Diffusion models have significantly advanced the field of talking head generation (THG). However, slow inference speeds and prevalent non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, a pioneering diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through a spatiotemporal variational autoencoder with a high compression ratio. Additionally, to enable semi-autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles into key-value caching for maintaining identity consistency and temporal coherence during long-term streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) strategy is proposed to mitigate error accumulation and enhance temporal consistency in streaming generation, leveraging a non-streaming teacher with an asynchronous noise schedule to supervise the streaming student. REST bridges the gap between autoregressive and diffusion-based approaches, achieving a breakthrough in efficiency for applications requiring real-time THG. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.
翻译:扩散模型显著推动了说话头生成领域的发展。然而,缓慢的推理速度与普遍存在的非自回归范式严重制约了基于扩散模型的说话头生成技术的应用。本研究提出REST,一种开创性的、基于扩散模型的、实时端到端流式音频驱动说话头生成框架。为实现实时端到端生成,首先通过一个具有高压缩率的时空变分自编码器学习一个紧凑的视频潜在空间。此外,为在该紧凑视频潜在空间内实现半自回归流式生成,我们引入了ID-上下文缓存机制,该机制将ID-Sink与Context-Cache原理整合到键值缓存中,以在长时流式生成过程中保持身份一致性与时间连贯性。进一步地,我们提出了一种异步流式蒸馏策略,以减轻流式生成中的误差累积并增强时间一致性,该策略利用一个采用异步噪声调度的非流式教师模型来监督流式学生模型。REST弥合了自回归方法与基于扩散方法之间的鸿沟,在需要实时说话头生成的应用中实现了效率上的突破。实验结果表明,REST在生成速度与整体性能上均优于现有先进方法。