Diffusion Transformers (DiTs) have gained prominence for outstanding scalability and extraordinary performance in generative tasks. However, their considerable inference costs impede practical deployment. The feature cache mechanism, which involves storing and retrieving redundant computations across timesteps, holds promise for reducing per-step inference time in diffusion models. Most existing caching methods for DiT are manually designed. Although the learning-based approach attempts to optimize strategies adaptively, it suffers from discrepancies between training and inference, which hampers both the performance and acceleration ratio. Upon detailed analysis, we pinpoint that these discrepancies primarily stem from two aspects: (1) Prior Timestep Disregard, where training ignores the effect of cache usage at earlier timesteps, and (2) Objective Mismatch, where the training target (align predicted noise in each timestep) deviates from the goal of inference (generate the high-quality image). To alleviate these discrepancies, we propose HarmoniCa, a novel method that Harmonizes training and inference with a novel learning-based Caching framework built upon Step-Wise Denoising Training (SDT) and Image Error Proxy-Guided Objective (IEPO). Compared to the traditional training paradigm, the newly proposed SDT maintains the continuity of the denoising process, enabling the model to leverage information from prior timesteps during training, similar to the way it operates during inference. Furthermore, we design IEPO, which integrates an efficient proxy mechanism to approximate the final image error caused by reusing the cached feature. Therefore, IEPO helps balance final image quality and cache utilization, resolving the issue of training that only considers the impact of cache usage on the predicted output at each timestep.
翻译:扩散Transformer(DiT)因其在生成任务中出色的可扩展性和卓越性能而备受关注。然而,其高昂的推理成本阻碍了实际部署。特征缓存机制通过跨时间步存储和检索冗余计算,有望减少扩散模型每步的推理时间。现有大多数DiT缓存方法为人工设计。尽管基于学习的方法尝试自适应优化策略,但其存在训练与推理之间的差异,这既损害了性能也限制了加速比。通过详细分析,我们发现这些差异主要源于两个方面:(1)先前时间步忽略,即训练时忽视了缓存使用在更早时间步的影响;(2)目标失配,即训练目标(对齐每个时间步的预测噪声)偏离了推理目标(生成高质量图像)。为缓解这些差异,我们提出HarmoniCa,一种新颖方法,通过基于逐步去噪训练(SDT)和图像误差代理引导目标(IEPO)构建的新型基于学习缓存框架,协调训练与推理。与传统训练范式相比,新提出的SDT保持了去噪过程的连续性,使模型在训练期间能够利用先前时间步的信息,类似于其在推理时的操作方式。此外,我们设计了IEPO,它集成了一种高效的代理机制来近似由重用缓存特征引起的最终图像误差。因此,IEPO有助于平衡最终图像质量与缓存利用率,解决了训练仅考虑缓存使用对每个时间步预测输出影响的问题。