Large language models are increasingly integrated with external tools and APIs like ChatGPT plugins to extend their capability beyond language-centric tasks. However, today's LLM inference systems are designed for standalone LLMs. They treat API calls as new requests, causing unnecessary recomputation of already computed contexts, which accounts for 37-40% of total model forwarding time. This paper presents APIServe, the first LLM inference framework targeting API-augmented LLMs. APISERVE minimizes the GPU resource waste caused by API calls and dedicates saved memory for serving more requests. APISERVE improves the overall serving throughput by 1.6x and completes 2x more requests per second compared to the state-of-the-art LLM inference systems.
翻译:大语言模型正日益与外部工具及API(如ChatGPT插件)集成,以扩展其超出语言类任务的能力。然而,当前的大语言模型推理系统是为独立大语言模型设计的。它们将API调用视为新请求,导致已计算上下文的不必要重新计算,这占模型总前向传播时间的37-40%。本文提出APIServe,这是首个面向API增强型大语言模型的推理框架。APIServe最大程度减少了API调用引起的GPU资源浪费,并将节省的内存专用于服务更多请求。与最先进的大语言模型推理系统相比,APIServe将整体服务吞吐量提升了1.6倍,且每秒完成的请求数量增加了2倍。