The modern autoregressive Large Language Models (LLMs) have achieved outstanding performance on NLP benchmarks, and they are deployed in the real world. However, they still suffer from limitations of the autoregressive training paradigm. For example, autoregressive token generation is notably slow and can be prone to \textit{exposure bias}. The diffusion-based language models were proposed as an alternative to autoregressive generation to address some of these limitations. We evaluate the recently proposed Score Entropy Discrete Diffusion (SEDD) approach and show it is a promising alternative to autoregressive generation but it has some short-comings too. We empirically demonstrate the advantages and challenges of SEDD, and observe that SEDD generally matches autoregressive models in perplexity and on benchmarks such as HellaSwag, Arc or WinoGrande. Additionally, we show that in terms of inference latency, SEDD can be up to 4.5$\times$ more efficient than GPT-2. While SEDD allows conditioning on tokens at abitrary positions, SEDD appears slightly weaker than GPT-2 for conditional generation given short prompts. Finally, we reproduced the main results from the original SEDD paper.
翻译:现代自回归大语言模型(LLMs)在自然语言处理基准测试中取得了卓越性能,并已在现实世界中部署应用。然而,它们仍受限于自回归训练范式的固有缺陷。例如,自回归的令牌生成速度显著缓慢,且容易产生**暴露偏差**。基于扩散的语言模型被提出作为自回归生成的替代方案,以解决部分此类局限性。我们评估了近期提出的分数熵离散扩散(SEDD)方法,结果表明它是一种有前景的自回归生成替代方案,但也存在一些不足。我们通过实证研究展示了SEDD的优势与挑战,并观察到SEDD在困惑度以及HellaSwag、Arc或WinoGrande等基准测试上的表现总体上与自回归模型相当。此外,我们发现在推理延迟方面,SEDD的效率可比GPT-2提升高达4.5倍。尽管SEDD支持在任意位置对令牌进行条件化生成,但在给定短提示的条件下生成任务中,SEDD的表现略弱于GPT-2。最后,我们成功复现了原始SEDD论文中的主要结果。