Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for "hard" inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel verification mode. These mechanisms offer different benefits: empirically, cascades are often capable of yielding better quality than even the larger model, while theoretically, speculative decoding offers a guarantee of quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Through experiments with T5 models on benchmark language tasks, we show that the proposed approach yields better cost-quality trade-offs than cascading and speculative decoding baselines.
翻译:级联与推测解码是提升语言模型推理效率的两种常用方法。这两种方法均涉及不同规模模型的交错使用,但其机制存在本质区别:级联采用延迟规则,仅在处理“困难”输入时调用大模型;而推测解码则通过推测执行,主要使大模型以并行验证模式运行。这些机制各有优势:实证研究表明,级联通常能获得比单一使用大模型更优的质量表现;而理论上,推测解码可保证质量中性。本文通过设计新型推测级联技术,融合了两种方法的优势——该技术通过推测执行实现其延迟规则。我们推导了推测级联的最优延迟规则,并采用最优规则的插件近似方法进行实现。通过在基准语言任务上对T5模型进行实验,结果表明所提方法相比级联与推测解码基线,能获得更优的成本-质量权衡。