Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To address this limitation, we conceptualize images as sequences of tokens with dynamic sizes, rather than traditional methods that perceive images as fixed-resolution grids. This perspective enables a flexible training strategy that seamlessly accommodates various aspect ratios during both training and inference, thus promoting resolution generalization and eliminating biases introduced by image cropping. On this basis, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. We further upgrade the FiT to FiTv2 with several innovative designs, includingthe Query-Key vector normalization, the AdaLN-LoRA module, a rectified flow scheduler, and a Logit-Normal sampler. Enhanced by a meticulously adjusted network structure, FiTv2 exhibits 2x convergence speed of FiT. When incorporating advanced training-free extrapolation techniques, FiTv2 demonstrates remarkable adaptability in both resolution extrapolation and diverse resolution generation. Additionally, our exploration of the scalability of the FiTv2 model reveals that larger models exhibit better computational efficiency. Furthermore, we introduce an efficient post-training strategy to adapt a pre-trained model for the high-resolution generation. Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions. We have released all the codes and models at https://github.com/whlzy/FiT to promote the exploration of diffusion transformer models for arbitrary-resolution image generation.
翻译:自然图像本质上是无限分辨率自由的。在这一现实背景下,现有扩散模型(如Diffusion Transformer)在处理训练域之外的图像分辨率时常常面临挑战。为突破这一限制,我们将图像概念化为动态尺寸的token序列,而非传统方法中将图像视为固定分辨率网格的视角。这一理念使得我们能够采用灵活的訓練策略,在训练和推理过程中无缝适应多种宽高比,从而促进分辨率泛化能力并消除图像裁剪引入的偏差。基于此,我们提出柔性视觉Transformer(FiT),这是一种专为生成无限制分辨率与宽高比图像而设计的Transformer架构。我们通过多项创新设计将FiT升级至FiTv2,包括查询-键向量归一化、AdaLN-LoRA模块、修正流调度器以及Logit-Normal采样器。借助精心调整的网络结构增强,FiTv2展现出FiT模型2倍的收敛速度。当结合先进的无训练外推技术时,FiTv2在分辨率外推与多样化分辨率生成方面均表现出卓越的适应性。此外,我们对FiTv2模型可扩展性的探索表明,更大规模的模型具有更优的计算效率。我们还提出一种高效的训练后策略,使预训练模型能够适配高分辨率生成任务。综合实验表明,FiTv2在广泛分辨率范围内均展现出卓越性能。我们已在https://github.com/whlzy/FiT 开源全部代码与模型,以推动面向任意分辨率图像生成的扩散Transformer模型探索。