ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

The recent surge of interest in comprehensive multimodal models has necessitated the unification of diverse modalities. However, the unification suffers from disparate methodologies. Continuous visual generation necessitates the full-sequence diffusion-based approach, despite its divergence from the autoregressive modeling in the text domain. We posit that autoregressive modeling, i.e., predicting the future based on past deterministic experience, remains crucial in developing both a visual generation model and a potential unified multimodal model. In this paper, we explore an interpolation between the autoregressive modeling and full-parameters diffusion to model visual information. At its core, we present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer, where the block size of diffusion, i.e., the size of autoregressive units, can be flexibly adjusted to interpolate between token-wise autoregression and full-sequence diffusion. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We verify the effectiveness of ACDiT on image and video generation tasks. We also demonstrate that benefitted from autoregressive modeling, ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. These strengths make it promising as the backbone of future unified models.

翻译：近年来，对综合性多模态模型的兴趣激增，使得统一多样化模态成为必要。然而，现有统一方法受限于不同模态间的方法论差异。连续视觉生成需要基于全序列的扩散方法，尽管这与文本领域的自回归建模存在根本分歧。我们认为，自回归建模——即基于过去确定性经验预测未来——对于开发视觉生成模型乃至潜在的统一多模态模型仍然至关重要。本文探索了自回归建模与全参数扩散之间的插值方法，以建模视觉信息。其核心是提出了ACDiT（自回归块条件扩散Transformer），其中扩散的块大小（即自回归单元的大小）可以灵活调整，从而在逐令牌自回归与全序列扩散之间进行插值。ACDiT易于实现，在训练过程中仅需创建跳跃因果注意力掩码（SCAM）即可。在推理阶段，该过程在扩散去噪与自回归解码之间迭代进行，并能充分利用KV-Cache机制。我们在图像和视频生成任务上验证了ACDiT的有效性。同时证明，受益于自回归建模，尽管ACDiT基于扩散目标训练，仍可无缝应用于视觉理解任务。对自回归建模与扩散之间权衡的分析表明，ACDiT在长序列视觉生成任务中具有应用潜力。这些优势使其有望成为未来统一模型的主干架构。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日