Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.
翻译:扩散Transformer(DiTs)在图像和视频生成领域已取得最先进的性能,但其成功伴随着沉重的计算代价。这种低效性很大程度上源于固定的标记化过程,该过程在整个去噪阶段使用恒定大小的图像块,而忽略了内容复杂性。我们提出动态标记化,一种高效的测试时策略,它根据内容复杂度和去噪时间步动态调整图像块大小。我们的核心见解是:早期时间步仅需较粗的图像块来建模全局结构,而后续迭代则需要更精细(尺寸更小)的图像块来优化局部细节。在推理过程中,我们的方法为图像和视频生成动态地重新分配去噪步骤间的图像块大小,在保持感知生成质量的同时显著降低了计算成本。大量实验证明了我们方法的有效性:在FLUX-1.Dev和Wan 2.1上分别实现了高达$3.52\times$和$3.2\times$的加速,且未损害生成质量与提示遵循度。