Speculative decoding (SD) is a widely used approach for accelerating decode-heavy LLM inference workloads. While online inference workloads are highly dynamic, existing SD systems are rigid and take a coarse-grained approach to SD management. They typically set the speculative token length for an entire batch and serialize the execution of the draft and verification phases. Consequently, these systems fall short at adapting to volatile online inference traffic. Under low load, they exhibit prolonged latency because the draft phase blocks the verification phase for the entire batch, leaving GPU computing resources underutilized. Conversely, under high load, they waste computation on rejected tokens during the verification phase, overloading GPU resources. We introduce FASER, a novel system that features fine-grained SD phase management. First, FASER minimizes computational waste by dynamically adjusting the speculative length for each request within a continuous batch and by performing early pruning of rejected tokens inside the verification phase. Second, FASER breaks the verification phase into frontiers, or chunks, to overlap them with the draft phase. This overlap is achieved via fine-grained spatial multiplexing with minimal resource interference. Our FASER prototype in vLLM improves throughput by up to 53% and reduces latency by up to 1.92$\times$ compared to state-of-the-art systems.
翻译:投机解码是一种广泛用于加速以解码为主的大语言模型推理工作负载的方法。然而,在线推理工作负载具有高度动态性,现有投机解码系统僵化且采用粗粒度的管理方式,通常为整个批次设定固定的推测令牌长度,并串行执行草稿阶段和验证阶段。因此,这些系统难以适应动态变化的在线推理流量:在低负载下,草稿阶段阻塞整个批次的验证阶段,导致GPU计算资源利用率不足,产生较长延迟;而在高负载下,验证阶段对已拒绝令牌的计算浪费会加重GPU资源过载。我们提出FASER——一种支持细粒度投机解码阶段管理的新型系统。首先,FASER通过动态调整连续批次中每个请求的推测长度,并在验证阶段内对已拒绝令牌进行早期剪枝,从而最小化计算浪费。其次,FASER将验证阶段分解为前沿片段(即分块),使其与草稿阶段重叠执行;这种重叠通过最小化资源干扰的细粒度空间复用实现。基于vLLM的FASER原型相比现有最优系统,吞吐量提升高达53%,延迟降低至原系统的1.92倍。