Large language models (LLMs) have shown extraordinary performance in various language tasks, but high computational requirements hinder their widespread deployment. Speculative decoding, which uses amateur models to predict the generation of expert models, has been proposed as a way to accelerate LLM inference. However, speculative decoding focuses on acceleration instead of making the best use of the token distribution from amateur models. We proposed Speculative Contrastive Decoding (SCD), an accelerated decoding method leveraging the natural contrast between expert and amateur models in speculative decoding. Comprehensive evaluations on four benchmarks show that SCD can achieve similar acceleration factors as speculative decoding while further improving the generation quality as the contrastive decoding. The analysis of token probabilities further demonstrates the compatibility between speculative and contrastive decoding. Overall, SCD provides an effective approach to enhance the decoding quality of LLMs while saving computational resources.
翻译:大型语言模型(LLMs)在各种语言任务中展现出卓越的性能,但其高计算需求阻碍了广泛部署。推测性解码通过使用业余模型预测专家模型的生成,已被提出用于加速LLM推理。然而,推测性解码侧重于加速而非充分利用业余模型的词元分布。我们提出推测性对比解码(SCD),一种利用推测性解码中专家模型与业余模型之间自然对比的加速解码方法。在四个基准测试上的综合评估表明,SCD能够实现与推测性解码相似的加速效果,同时进一步提高生成质量(如同对比解码的效果)。词元概率分析进一步证明了推测性与对比解码之间的兼容性。总体而言,SCD提供了一种有效的方法,在节省计算资源的同时提升LLM的解码质量。