Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than $3\times$ faster on average at $<5\%$ drop in accuracy relative to single token decoding performance.
翻译:现有的语言模型推理加速技术(如推测解码)需要训练辅助推测器模型,并构建和部署复杂的推理流水线。我们提出一种新方法,通过简单的在线蒸馏目标,将预训练的自回归语言模型从缓慢的单令牌预测模型转换为快速的独立多令牌预测模型。最终模型保持与预训练初始检查点完全相同的实现方式,无需添加任何辅助验证器或其他专用推理代码即可部署。在GSM8K数据集上,我们的方法生成的模型平均解码速度可提升超过$3\times$,同时相对单令牌解码性能的准确率下降幅度小于$5\%$。