Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI.
翻译:加速大规模语言模型(LLMs)的推理是人工智能领域的一项重要挑战。本文提出了分布式推测推理(DSI),这是一种新颖的分布式推理算法,其速度在理论上被证明快于推测推理(SI)[leviathan2023fast, chen2023accelerating, miao2023specinfer]和传统的自回归推理(非SI)。与其他SI算法类似,DSI作用于冻结的LLMs,无需训练或架构修改,并保持目标分布不变。先前关于SI的研究已证明其相对于非SI获得了经验性的加速,但这需要一个快速且准确的草稿LLM。在实践中,现成的LLM通常没有匹配的、足够快速和准确的草稿模型。我们揭示了一个差距:当使用速度较慢或准确性较低的草稿模型时,SI的速度会慢于非SI。我们通过证明DSI在使用任何草稿模型时都比SI和非SI更快,从而弥补了这一差距。通过协调多个目标模型和草稿模型的实例,DSI不仅比SI更快,还支持那些无法用SI加速的LLMs。我们的模拟实验展示了现成LLMs在现实场景中的加速效果:DSI比SI快1.29-1.92倍。