Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI.
翻译:加速大语言模型(LLMs)的推理是人工智能领域的重要挑战。本文提出了分布式推测推理(DSI),一种新颖的分布式推理算法,其理论速度优于推测推理(SI)[leviathan2023fast, chen2023accelerating, miao2023specinfer]和传统的自回归推理(非SI)。与其他SI算法类似,DSI作用于冻结的LLMs,无需训练或架构修改,并保持目标分布不变。先前关于SI的研究已证明其相对于非SI的经验加速效果,但这需要一个快速且准确的草稿LLM。在实践中,现成的LLM通常缺乏足够快速和准确的匹配草稿模型。我们揭示了一个缺陷:当使用速度较慢或准确性较低的草稿模型时,SI可能比非SI更慢。我们通过理论证明弥补了这一缺陷:在给定任意草稿模型的情况下,DSI的速度均优于SI和非SI。通过协调多个目标模型和草稿模型的实例,DSI不仅比SI更快,还支持那些无法通过SI加速的LLMs。我们的模拟实验展示了现成LLM在现实场景中的加速效果:DSI比SI快1.29-1.92倍。