DyDiT++：具有时间步与空间动态性的扩散Transformer用于高效视觉生成 (DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation)

from arxiv, This paper was accepted to the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) on January 9, 2026. arXiv admin note: substantial text overlap with arXiv:2410.03456

Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demonstrating that our method can also accelerate flow-matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional fine-tuning iterations, our approach reduces the FLOPs of DiT-XL by 51%, yielding 1.73x realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.

翻译：扩散Transformer（DiT）作为一种新兴的视觉生成扩散模型，已展现出卓越性能，但其计算成本高昂。我们的研究发现，这些成本主要源于静态推理范式，该范式不可避免地会在某些扩散时间步和空间区域引入冗余计算。为克服这一低效问题，我们提出了动态扩散Transformer（DyDiT），一种能够沿时间步和空间维度动态调整计算量的架构。基于这些设计，我们进一步提出了扩展版本DyDiT++，在三个关键方面进行了改进。首先，它将DyDiT的生成机制从扩散扩展到流匹配，证明我们的方法同样能加速基于流匹配的生成，从而增强了其通用性。此外，我们增强了DyDiT以处理更复杂的视觉生成任务，包括视频生成和文到图生成，从而拓宽了其实际应用范围。最后，为应对全参数微调的高成本并促进技术普及，我们探索了以参数高效方式训练DyDiT的可行性，并提出了基于时间步的动态LoRA（TD-LoRA）。在包括DiT、SiT、Latte和FLUX在内的多种视觉生成模型上进行的大量实验验证了DyDiT++的有效性。值得注意的是，在额外微调迭代次数<3%的情况下，我们的方法将DiT-XL的FLOPs降低了51%，在硬件上实现了1.73倍的实际加速，并在ImageNet上达到了2.07的竞争性FID分数。代码发布于https://github.com/alibaba-damo-academy/DyDiT。