FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To address this limitation, we conceptualize images as sequences of tokens with dynamic sizes, rather than traditional methods that perceive images as fixed-resolution grids. This perspective enables a flexible training strategy that seamlessly accommodates various aspect ratios during both training and inference, thus promoting resolution generalization and eliminating biases introduced by image cropping. On this basis, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. We further upgrade the FiT to FiTv2 with several innovative designs, includingthe Query-Key vector normalization, the AdaLN-LoRA module, a rectified flow scheduler, and a Logit-Normal sampler. Enhanced by a meticulously adjusted network structure, FiTv2 exhibits 2x convergence speed of FiT. When incorporating advanced training-free extrapolation techniques, FiTv2 demonstrates remarkable adaptability in both resolution extrapolation and diverse resolution generation. Additionally, our exploration of the scalability of the FiTv2 model reveals that larger models exhibit better computational efficiency. Furthermore, we introduce an efficient post-training strategy to adapt a pre-trained model for the high-resolution generation. Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions. We have released all the codes and models at https://github.com/whlzy/FiT to promote the exploration of diffusion transformer models for arbitrary-resolution image generation.

翻译：自然图像本质上是无限分辨率自由的。在这一现实背景下，现有扩散模型（如Diffusion Transformer）在处理训练域之外的图像分辨率时常常面临挑战。为突破这一限制，我们将图像概念化为动态尺寸的token序列，而非传统方法中将图像视为固定分辨率网格的视角。这一理念使得我们能够采用灵活的訓練策略，在训练和推理过程中无缝适应多种宽高比，从而促进分辨率泛化能力并消除图像裁剪引入的偏差。基于此，我们提出柔性视觉Transformer（FiT），这是一种专为生成无限制分辨率与宽高比图像而设计的Transformer架构。我们通过多项创新设计将FiT升级至FiTv2，包括查询-键向量归一化、AdaLN-LoRA模块、修正流调度器以及Logit-Normal采样器。借助精心调整的网络结构增强，FiTv2展现出FiT模型2倍的收敛速度。当结合先进的无训练外推技术时，FiTv2在分辨率外推与多样化分辨率生成方面均表现出卓越的适应性。此外，我们对FiTv2模型可扩展性的探索表明，更大规模的模型具有更优的计算效率。我们还提出一种高效的训练后策略，使预训练模型能够适配高分辨率生成任务。综合实验表明，FiTv2在广泛分辨率范围内均展现出卓越性能。我们已在https://github.com/whlzy/FiT 开源全部代码与模型，以推动面向任意分辨率图像生成的扩散Transformer模型探索。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日