This paper introduces AdaServe, the first LLM serving system to support SLO customization through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to predict the speculative accuracy of tokens and employs a theoretically optimal algorithm to construct token trees for verification. To accommodate diverse SLO requirements without compromising throughput, AdaServe employs a speculation-and-selection scheme that first constructs candidate token trees for each request and then dynamically selects tokens to meet individual SLO constraints while optimizing throughput. Comprehensive evaluations demonstrate that AdaServe achieves up to 73% higher SLO attainment and 74% higher goodput compared to state-of-the-art systems. These results underscore AdaServe's potential to enhance the efficiency and adaptability of LLM deployments across varied application scenarios.
翻译:本文提出AdaServe,首个通过细粒度推测解码支持服务水平目标(SLO)定制化的大语言模型服务系统。AdaServe利用草稿模型的逻辑值预测词元的推测准确率,并采用理论最优算法构建验证用词元树。为在不牺牲吞吐量的前提下满足多样化SLO需求,AdaServe采用"推测-选择"机制:首先为每个请求构建候选词元树,随后动态选择词元以满足个体SLO约束并优化系统吞吐量。综合评估表明,相较于前沿系统,AdaServe的SLO达成率最高提升73%,有效吞吐量最高提升74%。这些结果彰显了AdaServe在多样化应用场景中提升大语言模型部署效率与适应性的潜力。