While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
翻译:尽管扩散模型在视频生成领域取得了巨大成功,但这一进展伴随着计算负担的快速增加。在现有的加速方法中,特征缓存因其无需训练的特性及可观的加速性能而广受欢迎,但随着压缩程度的加深,它不可避免地面临语义和细节的丢失。另一种广泛采用的方法——训练感知的步数蒸馏,虽然在图像生成中取得了成功,但在视频生成中,当步数较少时同样面临性能的急剧下降。此外,由于采样步数更为稀疏,将无需训练的特征缓存直接应用于步数蒸馏模型时,质量损失会变得更加严重。本文首次创新性地引入了一种蒸馏兼容的可学习特征缓存机制。我们采用轻量级的可学习神经预测器替代传统的无需训练启发式方法,使其能够更准确地捕捉高维特征的演化过程。此外,我们探讨了在大规模视频模型上进行高度压缩蒸馏所面临的挑战,并提出了一种保守的受限平均流方法,以实现更稳定且无损的蒸馏。通过采取这些措施,我们在保持生成质量的同时,进一步将加速边界推至$11.8\times$。大量实验证明了我们方法的有效性。代码位于补充材料中并将公开提供。