Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10$\times$ more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-$\alpha$, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36$\times$ and 1.93$\times$ acceleration are achieved on OpenSora and PixArt-$\alpha$ with almost no drop in generation quality.
翻译:扩散Transformer在图像与视频合成领域展现出卓越性能,但伴随巨大的计算开销。为缓解此问题,特征缓存方法被提出以加速扩散Transformer,其通过缓存先前时间步的特征并在后续时间步中复用。然而,现有缓存方法忽略了不同令牌对特征缓存的敏感性差异:对部分令牌进行特征缓存可能导致生成质量下降幅度达到其他令牌的10倍以上。本文提出令牌级特征缓存方法,使我们能自适应选择最适合缓存的令牌,并进一步实现对不同类型及深度神经网络层采用差异化的缓存比例。在PixArt-α、OpenSora及DiT上的大量实验表明,该方法在无需训练的前提下,对图像与视频生成均具有显著加速效果。例如,在OpenSora与PixArt-α上分别实现2.36倍与1.93倍加速,且生成质量几乎无衰减。