Large language models are increasingly integrated with external environments, tools, and agents like ChatGPT plugins to extend their capability beyond language-centric tasks. However, today's LLM inference systems are designed for standalone LLMs. They treat each external interaction as the end of LLM generation and form a new request when the interaction finishes, causing unnecessary recomputation of already computed contexts, which accounts for 37-40% of total model forwarding time. This paper presents InferCept, the first LLM inference framework targeting augmented LLMs and supporting the efficient interception of LLM generation. InferCept minimizes the GPU resource waste caused by LLM interceptions and dedicates saved memory for serving more requests. InferCept improves the overall serving throughput by 1.6x-2x and completes 2x more requests per second compared to the state-of-the-art LLM inference systems.
翻译:大语言模型正日益与外部环境、工具及智能体(如ChatGPT插件)相结合,以将其能力扩展至语言中心任务之外。然而,当前的大语言模型推理系统专为独立大语言模型设计,将每次外部交互视为大语言模型生成的终点,并在交互结束后形成新的请求,导致已计算上下文的重复计算,这占总模型前向传播时间的37%至40%。本文提出了InferCept,首个面向增强型大语言模型并支持高效截断大语言模型生成的推理框架。InferCept最大限度地减少了由大语言模型截断引起的GPU资源浪费,并将节省的内存用于服务更多请求。与现有最先进的大语言模型推理系统相比,InferCept将整体服务吞吐量提高了1.6至2倍,每秒完成的请求数增加了2倍。