We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion's multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion's ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.
翻译:我们提出因果扩散模型作为扩散模型的自回归对应形式。这是一种面向离散与连续模态友好的下一令牌预测框架,且兼容LLaMA、GPT等现有下一令牌预测模型。尽管近期研究尝试将扩散模型与自回归模型结合,我们发现对扩散模型引入序列因子分解能显著提升其性能,并实现自回归与扩散生成模式间的平滑过渡。为此,我们提出CausalFusion——一种仅含解码器的Transformer架构,通过对序列令牌与扩散噪声层级进行双重因子分解,在ImageNet生成基准测试中取得最先进的结果,同时兼具自回归模型可生成任意数量令牌以支持上下文推理的优势。我们进一步通过联合图像生成与描述模型展示了CausalFusion的多模态能力,并验证其在零样本上下文图像编辑任务中的有效性。本研究旨在为离散与连续数据的多模态模型训练提供新的视角。