The complexity of large language model (LLM) serving workloads has substantially increased due to the integration with external tool invocations, such as ChatGPT plugins. In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding. To this end, we design Conveyor, an efficient LLM serving system optimized for handling requests involving external tools. We introduce a novel interface for tool developers to expose partial execution opportunities to the LLM serving system and a request scheduler that facilitates partial tool execution. Our results demonstrate that tool partial execution can improve request completion latency by up to 38.8%.
翻译:随着大型语言模型(LLM)服务负载与外部工具调用(如ChatGPT插件)的集成,其复杂性显著增加。本文针对触发工具调用的请求,发现了一种新的高效LLM服务机会:在LLM解码过程中并行执行工具部分计算。为此,我们设计了Conveyor——一个专为处理涉及外部工具的请求而优化的高效LLM服务系统。我们提出了一种新颖的接口,使工具开发者能够向LLM服务系统暴露部分执行机会,并设计了一个支持部分工具执行的请求调度器。实验结果表明,工具部分执行可将请求完成延迟降低最高达38.8%。