Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achieve substantial speedups typically rely on an additional trained draft model or auxiliary model components, increasing deployment and maintenance complexity. This added complexity reduces flexibility, particularly when serving workloads shift to tasks, domains, or languages that are not well represented in the draft model's training data. We introduce Simply-Scalable Speculative Decoding (SSSD), a training-free method that combines lightweight n-gram matching with hardware-aware speculation. Relative to standard autoregressive decoding, SSSD reduces latency by up to 2.9x. It achieves performance on par with leading training-based approaches across a broad range of benchmarks, while requiring substantially lower adoption effort--no data preparation, training or tuning are needed--and exhibiting superior robustness under language and domain shift, as well as in long-context settings.
翻译:推测解码已成为加速大型语言模型推理的一种流行技术。然而,现有方法大多在生成式服务系统中仅能带来有限的性能提升。那些能实现显著加速的方法通常依赖于额外训练过的草稿模型或辅助模型组件,这增加了部署和维护的复杂性。这种额外的复杂性降低了灵活性,尤其是在服务负载转向草稿模型训练数据中未充分代表的任务、领域或语言时。我们提出了简单可扩展的推测解码(SSSD),这是一种无需训练的方法,它将轻量级的 n-gram 匹配与硬件感知的推测相结合。相对于标准的自回归解码,SSSD 将延迟降低了高达 2.9 倍。它在广泛的基准测试中实现了与领先的基于训练的方法相当的性能,同时显著降低了采用成本——无需数据准备、训练或调优——并且在语言和领域迁移以及长上下文设置中表现出更优的鲁棒性。