The recent surge of interest in comprehensive multimodal models has necessitated the unification of diverse modalities. However, the unification suffers from disparate methodologies. Continuous visual generation necessitates the full-sequence diffusion-based approach, despite its divergence from the autoregressive modeling in the text domain. We posit that autoregressive modeling, i.e., predicting the future based on past deterministic experience, remains crucial in developing both a visual generation model and a potential unified multimodal model. In this paper, we explore an interpolation between the autoregressive modeling and full-parameters diffusion to model visual information. At its core, we present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer, where the block size of diffusion, i.e., the size of autoregressive units, can be flexibly adjusted to interpolate between token-wise autoregression and full-sequence diffusion. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We verify the effectiveness of ACDiT on image and video generation tasks. We also demonstrate that benefitted from autoregressive modeling, ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. These strengths make it promising as the backbone of future unified models.
翻译:近年来,对综合性多模态模型的兴趣激增,使得统一多样化模态成为必要。然而,现有统一方法受限于不同模态间的方法论差异。连续视觉生成需要基于全序列的扩散方法,尽管这与文本领域的自回归建模存在根本分歧。我们认为,自回归建模——即基于过去确定性经验预测未来——对于开发视觉生成模型乃至潜在的统一多模态模型仍然至关重要。本文探索了自回归建模与全参数扩散之间的插值方法,以建模视觉信息。其核心是提出了ACDiT(自回归块条件扩散Transformer),其中扩散的块大小(即自回归单元的大小)可以灵活调整,从而在逐令牌自回归与全序列扩散之间进行插值。ACDiT易于实现,在训练过程中仅需创建跳跃因果注意力掩码(SCAM)即可。在推理阶段,该过程在扩散去噪与自回归解码之间迭代进行,并能充分利用KV-Cache机制。我们在图像和视频生成任务上验证了ACDiT的有效性。同时证明,受益于自回归建模,尽管ACDiT基于扩散目标训练,仍可无缝应用于视觉理解任务。对自回归建模与扩散之间权衡的分析表明,ACDiT在长序列视觉生成任务中具有应用潜力。这些优势使其有望成为未来统一模型的主干架构。