Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in multilingual settings. To mitigate this challenge, this paper explores a training recipe of an assistant model in speculative decoding, which is leveraged to draft and-then its future tokens are verified by the target LLM. We show that language-specific draft models, optimized through a targeted pretrain-and-finetune strategy, substantially brings a speedup in inference time compared to the previous methods. We validate these models across various languages in inference time, out-of-domain speedup, and GPT-4o evaluation.
翻译:大语言模型(LLMs)已彻底变革自然语言处理领域,并拓宽了其在多样化商业应用中的适用性。然而,这些模型的部署在多语言场景下受限于高昂的推理时间。为缓解这一挑战,本文探讨了推测解码中辅助模型的训练方案,该方案利用辅助模型草拟未来令牌,随后由目标大语言模型进行验证。研究表明,通过有针对性的预训练与微调策略优化的语言专用草稿模型,相较于先前方法,能显著提升推理速度。我们在推理时间、领域外加速效果以及GPT-4o评估等多个维度上,验证了这些模型在多种语言中的性能表现。