Diffusion Transformers (DiTs) incur prohibitive computational costs due to the quadratic scaling of self-attention. Existing pruning methods fail to simultaneously satisfy differentiability, efficiency, and the strict static budgets required for hardware overhead. To address this, we propose Shiva-DiT, which effectively reconciles these conflicting requirements via Residual-Based Differentiable Top-$k$ Selection. By leveraging a residual-aware straight-through estimator, our method enforces deterministic token counts for static compilation while preserving end-to-end learnability through residual gradient estimation. Furthermore, we introduce a Context-Aware Router and Adaptive Ratio Policy to autonomously learn an adaptive pruning schedule. Experiments on mainstream models, including SD3.5, demonstrate that Shiva-DiT establishes a new Pareto frontier, achieving a 1.54$\times$ wall-clock speedup with superior fidelity compared to existing baselines, effectively eliminating ragged tensor overheads.
翻译:扩散Transformer(DiTs)由于自注意力机制的二次计算复杂度而带来过高的计算成本。现有的剪枝方法难以同时满足可微性、高效性以及硬件开销所需的严格静态预算要求。为解决此问题,我们提出了Shiva-DiT,它通过基于残差的可微分Top-$k$选择有效地调和了这些相互冲突的需求。通过利用残差感知的直通估计器,我们的方法在保持端到端可学习性(通过残差梯度估计)的同时,为静态编译强制执行确定性的令牌数量。此外,我们引入了上下文感知路由器和自适应比率策略,以自主地学习自适应剪枝调度。在包括SD3.5在内的主流模型上的实验表明,Shiva-DiT建立了一个新的帕累托前沿,与现有基线相比,在获得更优保真度的同时实现了1.54$\times$的挂钟时间加速,并有效消除了不规则张量的开销。