Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. (2018) as a way to improve inference speed of language models. In this paper, we make two contributions to understanding and improving BPD drafts. We first offer an analysis of the token distributions produced by the BPD prediction heads. Secondly, we use this analysis to inform algorithms to improve BPD inference speed by refining the BPD drafts using small n-gram or neural language models. We empirically show that these refined BPD drafts yield a higher average verified prefix length across tasks.
翻译:尽管自回归语言模型取得了显著进展,但其潜力常受限于顺序令牌生成固有的慢速推理。Stern等人(2018)提出了逐块并行解码(Blockwise Parallel Decoding, BPD)作为提升语言模型推理速度的方法。本文在理解与改进BPD草稿方面做出两项贡献。首先,我们分析了BPD预测头产生的令牌分布;其次,基于此分析,我们提出利用小型n-gram或神经语言模型优化BPD草稿以提升推理速度的算法。实验表明,经优化的BPD草稿在各项任务中获得了更高的平均验证前缀长度。