Diffusion Transformers with Mixture-of-Experts (DiT-MoE) improve model capacity under sparse activation, but diffusion inference is still bottlenecked by redundant computation across timesteps. Existing caching methods mainly operate at the token level, which becomes suboptimal in DiT-MoE because each token update is internally decomposed into multiple routed expert branches. Our analysis shows that cross-timestep redundancy in DiT-MoE is better characterized at the expert-branch level than at the whole-token level. Based on this observation, we propose MoECa, a fine-grained caching framework that performs branch-level feature reuse across timesteps. MoECa further introduces expert-aware adaptive control and synchronized cache updates across MoE and attention paths to maintain stable intermediate states. Experiments on multiple DiT-MoE models show that MoECa consistently achieves a better speed-quality trade-off than prior caching methods, with up to 2.83$\times$ inference speedup and minimal quality degradation.
翻译:基于混合专家模型的扩散Transformer(DiT-MoE)通过稀疏激活提升了模型容量,但扩散推理仍受限于跨时间步的冗余计算。现有缓存方法主要在token层面操作,在DiT-MoE中变得次优,因为每个token更新在内部被分解为多个路由专家分支。我们的分析表明,DiT-MoE中的跨时间步冗余更宜在专家分支层面而非整个token层面进行刻画。基于此观察,我们提出MoECa——一种精细化缓存框架,在时间步间实现分支级特征复用。MoECa进一步引入专家感知的自适应控制,并在MoE与注意力路径间同步缓存更新,以维持稳定的中间状态。在多个DiT-MoE模型上的实验表明,相较于现有缓存方法,MoECa始终实现了更优的速度-质量权衡,推理速度最高提升2.83倍且质量退化极小。