Speculative decoding is a promising method for reducing the inference latency of large language models. The effectiveness of the method depends on the speculation length (SL) - the number of tokens generated by the draft model at each iteration. The vast majority of speculative decoding approaches use the same SL for all iterations. In this work, we show that this practice is suboptimal. We introduce DISCO, a DynamIc SpeCulation length Optimization method that uses a classifier to dynamically adjust the SL at each iteration, while provably preserving the decoding quality. Experiments with four benchmarks demonstrate average speedup gains of 10.3% relative to our best baselines.
翻译:推测解码是一种减少大语言模型推理延迟的有效方法。其效果取决于推测长度(SL),即每次迭代中草稿模型生成的token数量。绝大多数推测解码方法在所有迭代中使用相同的SL。本文证明这种做法并非最优。我们提出DISCO(动态推测长度优化方法),该方法通过分类器在每次迭代中动态调整SL,同时可证明保持解码质量不变。在四个基准测试上的实验表明,相较于最优基线模型,该方法平均获得10.3%的加速增益。