Diffusion models have gained significant attention in the realm of image generation due to their exceptional performance. Their success has been recently expanded to text generation via generating all tokens within a sequence concurrently. However, natural language exhibits a far more pronounced sequential dependency in comparison to images, and the majority of existing language models are trained utilizing a left-to-right auto-regressive approach. To account for the inherent sequential characteristic of natural language, we introduce Auto-Regressive Diffusion (AR-Diffusion). AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps that vary based on token position. This results in tokens on the left undergoing fewer denoising steps than those on the right, thereby enabling them to generate earlier and subsequently influence the generation of tokens on the right. In a series of experiments on various text generation tasks including text summarization, machine translation, and common sense generation, AR-Diffusion clearly demonstrated the superiority over existing diffusion language models and that it can be $100\times\sim600\times$ faster when achieving comparable results. Our code will be publicly released.
翻译:扩散模型因其卓越性能在图像生成领域引起了广泛关注。其成功近期已扩展至文本生成领域,通过同时生成序列中的所有词元来实现。然而,与图像相比,自然语言展现出更为显著的序列依赖性,且现有大多数语言模型采用从左到右的自回归方式进行训练。为体现自然语言的固有序列特性,我们引入了自回归扩散模型(AR-Diffusion)。AR-Diffusion通过采用基于词元位置动态变化的去噪步数机制,确保右侧词元的生成依赖于已生成的左侧词元。这使得左侧词元经历的去噪步数少于右侧词元,从而能够更早生成并进而影响右侧词元的生成过程。在包括文本摘要、机器翻译和常识生成在内的多种文本生成任务实验中,AR-Diffusion明显优于现有扩散语言模型,且在获得相当结果时速度可提升100倍至600倍。我们的代码将公开发布。