RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce \textbf{RhymeFlow}, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.

翻译：基于扩散Transformer（DiTs）的视频生成模型在视频合成中展现出卓越性能，但由于3D注意力机制的二次复杂度，其推理延迟和计算成本仍然较高。现有加速方法主要通过稀疏注意力、KV缓存等技术降低单个去噪步骤的计算复杂度。然而，这些方法严格遵循标准扩散流程的内在约束：目标视频序列中的每一帧都必须在所有扩散时间步中完成完整、密集的去噪过程。我们观察到，由于相邻帧间存在对应的内容与运动，当锚定具有关键语义过渡的关键帧时，其他帧的中间状态往往呈现更可预测的轨迹，这表明对于自然视频数据而言，这种均匀密集的去噪过程本质上存在冗余。为此，我们提出无需训练框架\textbf{RhymeFlow}，该框架解耦了不同帧的去噪轨迹。具体而言，我们首先识别出一组稀疏的关键帧，它们主导潜在语义演化过程。随后，仅对这些关键帧执行密集的逐步去噪以确保结构完整性，而非关键帧则通过逐步跳过去噪步骤以最小化计算成本。由于非关键帧跳过的中间状态会破坏关键帧去噪步骤中的时间连贯性，导致视觉质量下降，我们进一步引入潜在轨迹投影模块，使关键帧能够与完整且时序一致的序列表征进行交互。在基于DiT的当前视频生成模型上的大量实验表明，我们的方法在推理速度和视觉效果上均优于现有基线方法。