Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language model (LLM). This approach enhances the efficiency of inference serving by reducing LLM inference latency and costs while preserving generation quality. However, existing speculative methods face critical challenges, including inefficient resource utilization and limited draft acceptance, which constrain their scalability and overall effectiveness. To overcome these obstacles, we present CoSine, a novel speculative inference system that decouples sequential speculative decoding from parallel verification, enabling efficient collaboration among multiple nodes. Specifically, CoSine routes inference requests to specialized drafters based on their expertise and incorporates a confidence-based token fusion mechanism to synthesize outputs from cooperating drafters, ensuring high-quality draft generation. Additionally, CoSine dynamically orchestrates the execution of speculative decoding and verification in a pipelined manner, employing batch scheduling to selectively group requests and adaptive speculation control to minimize idle periods. By optimizing parallel workflows through heterogeneous node collaboration, CoSine balances draft generation and verification throughput in real-time, thereby maximizing resource utilization. Experimental results demonstrate that CoSine achieves superior performance compared to state-of-the-art speculative approaches. Notably, with equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.
翻译:推测推理是一种有前景的范式,它采用小型推测模型作为草稿生成器来产生候选标记,随后由目标大语言模型并行验证。该方法通过降低大语言模型推理延迟与成本,同时保持生成质量,从而提升推理服务效率。然而,现有推测方法面临关键挑战,包括资源利用率低下和候选接受率有限,这制约了其可扩展性与整体效能。为克服这些障碍,我们提出CoSine——一种创新的推测推理系统,该系统将顺序推测解码与并行验证解耦,实现多节点间高效协同。具体而言,CoSine根据专业能力将推理请求路由至专用草稿生成器,并引入基于置信度的标记融合机制,综合协作生成器的输出以确保高质量草稿生成。此外,CoSine以流水线方式动态编排推测解码与验证的执行过程,采用批处理调度对请求进行选择性分组,并通过自适应推测控制最小化空闲时间。通过异构节点协作优化并行工作流,CoSine实时平衡草稿生成与验证吞吐量,从而实现资源利用率最大化。实验结果表明,相较于最先进的推测方法,CoSine展现出卓越性能。值得注意的是,在同等资源成本下,与基线方法相比,CoSine最高可降低23.2%的延迟并提升32.5%的吞吐量。