This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.
翻译:本技术报告描述了用于加速生产环境中大语言模型推理速度的新型投机解码草稿模型的设计与训练方法。通过使草稿预测同时依赖于上下文向量和采样令牌,我们训练出的投机器能够高效预测高质量n-gram片段,随后由基础模型进行接受或拒绝。这使得每个推理前向传播步骤可有效预测多个令牌,将高度优化的基础模型实现的端到端推理速度提升2-3倍。本文探讨了这些初步成果,并描述了后续改进方向。