Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce Accelerated Rolling Diffusion, a novel framework for streaming gesture generation that extends rolling diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that restructures the noise schedule into a stepwise ladder, allowing multiple frames to be denoised simultaneously. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 2x speedup with high visual fidelity and temporal coherence. We evaluate our approach on ZEGGS and BEAT, strong benchmarks for real-world applicability. Our framework is universally applicable to any diffusion-based gesture generation model, transforming it into a streaming approach. Applied to three state-of-the-art methods, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time, high-fidelity co-speech gesture synthesis.
翻译:实时生成语音伴随手势需要同时满足时间连贯性和高效采样。我们提出加速滚动扩散,这是一种用于流式手势生成的新框架,通过结构化渐进噪声调度扩展了滚动扩散模型,能够在保持真实性和多样性的同时实现无缝长序列动作合成。我们进一步提出滚动扩散阶梯加速,这是一种将噪声调度重构为逐步阶梯的新方法,允许多帧同时去噪。这显著提高了采样效率,同时保持了动作一致性,在实现高达2倍加速的同时保证了高视觉保真度和时间连贯性。我们在ZEGGS和BEAT这两个具有实际应用价值的强基准上评估了我们的方法。我们的框架普遍适用于任何基于扩散的手势生成模型,可将其转化为流式方法。应用于三种最先进的方法时,该框架均表现出更优性能,证明了其作为实时高保真语音伴随手势合成的通用高效解决方案的有效性。