In this report, we explore the potential for text diffusion to replace autoregressive (AR) decoding for the training and deployment of large language models (LLMs). We are particularly interested to see whether pretrained AR models can be transformed into text diffusion models through a lightweight adaptation procedure we call ``AR2Diff''. We begin by establishing a strong baseline setup for training text diffusion models. Comparing across multiple architectures and pretraining objectives, we find that training a decoder-only model with a prefix LM objective is best or near-best across several tasks. Building on this finding, we test various transfer learning setups for text diffusion models. On machine translation, we find that text diffusion underperforms the standard AR approach. However, on code synthesis and extractive QA, we find diffusion models trained from scratch outperform AR models in many cases. We also observe quality gains from AR2Diff -- adapting AR models to use diffusion decoding. These results are promising given that text diffusion is relatively underexplored and can be significantly faster than AR decoding for long text generation.
翻译:在本报告中,我们探讨了文本扩散在大型语言模型(LLM)训练与部署中替代自回归(AR)解码的潜力。我们特别关注能否通过名为“AR2Diff”的轻量级适配流程将预训练的AR模型转化为文本扩散模型。首先,我们建立了训练文本扩散模型的强基线设置。通过对比多种架构与预训练目标,我们发现采用前缀语言模型(prefix LM)目标的仅解码器架构在多项任务中表现最佳或接近最佳。基于这一发现,我们测试了多种文本扩散模型的迁移学习设置。在机器翻译任务中,文本扩散的性能低于标准AR方法。但在代码合成与抽取式问答任务中,从头训练的扩散模型在多数情况下优于AR模型。我们还观察到AR2Diff带来的质量提升——即适配AR模型以使用扩散解码。鉴于文本扩散领域研究相对不足,且其在长文本生成中速度显著优于AR解码,这些结果令人鼓舞。