Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a $1.6\times$ ($2.0\times$) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at https://github.com/fudan-generative-vision/PPFlow.
翻译:扩散变换器(DiTs)采用分块化方法,通过线性投影将分块表示映射为标记表示,从而调整输入DiT块的标记数量及计算成本。不同于所有时间步采用单一分块尺寸,我们提出金字塔式分块化流程(PPFlow)方法:高噪声时间步使用大分块尺寸,低噪声时间步使用小分块尺寸;为每个分块尺寸学习独立的线性投影;并相应修改反分块化操作。与金字塔流方法不同,我们的方法基于完整潜在表示而非金字塔表示进行操作,并采用标准去噪过程而无需重新加噪技巧。我们通过两种训练方式验证了方法的有效性:从头训练时,2级(3级)金字塔分块化在训练FLOPs略低且图像生成性能相近的情况下,推理速度达到SiT-B/2的$1.6\times$($2.0\times$);基于预训练常规DiTs的微调训练能以较少训练时间获得更优性能。代码与检查点位于https://github.com/fudan-generative-vision/PPFlow。