Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruning achieves a balance between acceleration and generation quality. However, existing methods typically employ fixed and heuristic schemes to configure caching and pruning strategies. While they roughly follow the overall sensitivity trend of generation models to acceleration, they fail to capture fine-grained and complex variations, inevitably skipping highly sensitive computations and leading to quality degradation. Furthermore, such manually designed strategies exhibit poor generalization. To address these issues, we propose SODA, a Sensitivity-Oriented Dynamic Acceleration method that adaptively performs caching and pruning based on fine-grained sensitivity. SODA builds an offline sensitivity error modeling framework across timesteps, layers, and modules to capture the sensitivity to different acceleration operations. The cache intervals are optimized via dynamic programming with sensitivity error as the cost function, minimizing the impact of caching on model sensitivity. During pruning and cache reuse, SODA adaptively determines the pruning timing and rate to preserve computations of highly sensitive tokens, significantly enhancing generation fidelity. Extensive experiments on DiT-XL/2, PixArt-$α$, and OpenSora demonstrate that SODA achieves state-of-the-art generation fidelity under controllable acceleration ratios. Our code is released publicly at: https://github.com/leaves162/SODA.
翻译:扩散Transformer已成为视觉生成领域的主导范式,但其低推理效率仍是制约进一步发展的关键瓶颈。在常见的免训练加速技术中,缓存方法虽能实现高加速效率却常以保真度降低为代价,而剪枝方法则呈现相反的权衡特性。融合缓存与剪枝技术可在加速效率与生成质量间取得平衡,然而现有方法通常采用固定的启发式方案配置缓存与剪枝策略。尽管这些方案大致遵循生成模型对加速操作的全局灵敏度趋势,但未能捕捉细粒度、复杂的灵敏度变化,不可避免地跳过高灵敏度计算导致质量下降。此外,这种人工设计策略泛化能力薄弱。针对上述问题,本文提出面向灵敏度的动态加速方法SODA,该方法基于细粒度灵敏度自适应执行缓存与剪枝操作。SODA构建了一个跨时间步、跨层、跨模块的离线灵敏度误差建模框架,以捕获对不同加速操作的灵敏度响应。通过将灵敏度误差作为代价函数进行动态规划优化缓存间隔,最小化缓存操作对模型灵敏度的影响。在剪枝与缓存复用阶段,SODA自适应确定剪枝时机与剪枝率以保留高灵敏度标记的计算,显著提升生成保真度。在DiT-XL/2、PixArt-α和OpenSora上的大量实验表明,SODA在可控加速比下实现了最先进的生成保真度。我们的代码已开源:https://github.com/leaves162/SODA。