Diffusion models have gained significant attention in the realm of image generation due to their exceptional performance. Their success has been recently expanded to text generation via generating all tokens within a sequence concurrently. However, natural language exhibits a far more pronounced sequential dependency in comparison to images, and the majority of existing language models are trained with a left-to-right auto-regressive approach. To account for the inherent sequential characteristic of natural language, we introduce Auto-Regressive Diffusion (AR-Diffusion). AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps that vary based on token position. This results in tokens on the left undergoing fewer denoising steps than those on the right, thereby enabling them to generate earlier and subsequently influence the generation of tokens on the right. In a series of experiments on various text generation tasks, including text summarization, machine translation, and common sense generation, AR-Diffusion clearly demonstrated its superiority over existing diffusion language models and that it can be $100\times\sim600\times$ faster when achieving comparable results. Our code is available at https://github.com/microsoft/ProphetNet/tree/master/AR-diffusion.
翻译:扩散模型凭借其卓越的性能在图像生成领域获得了广泛关注。近年来,该成功已扩展至文本生成领域,通过同时生成序列中的所有词元实现。然而,自然语言相较于图像呈现出更为显著的序列依赖性,且现有大多数语言模型均采用从左到右的自回归方式进行训练。为契合自然语言固有的序列特征,我们提出了自回归扩散模型(AR-Diffusion)。该模型通过采用基于词元位置动态调整的、不同步数的去噪机制,确保右侧词元的生成依赖于左侧已生成的词元。具体而言,左侧词元经历的去噪步数少于右侧词元,从而使其更早生成并影响后续右侧词元的生成过程。在包括文本摘要、机器翻译及常识生成在内的多项文本生成任务实验中,AR-Diffusion 均展现出相较于现有扩散语言模型的显著优势,且能在达到可比结果时实现 $100\times\sim600\times$ 的加速。我们的代码已开源至 https://github.com/microsoft/ProphetNet/tree/master/AR-diffusion。