Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language understanding, information retrieval and search, translation, chatbots, virtual assistance, and many more. However, it is well known that LLMs are massive in terms of the number of parameters. Additionally, the self-attention mechanism in the underlying architecture of LLMs, Transformers, has quadratic complexity in terms of both computation and memory with respect to the input sequence length. For these reasons, LLM inference is resource-intensive, and thus, the throughput of LLM inference is limited, especially for the longer sequences. In this report, we design a collaborative inference architecture between a server and its clients to alleviate the throughput limit. In this design, we consider the available resources on both sides, i.e., the computation and communication costs. We develop a dynamic programming-based algorithm to optimally allocate computation between the server and the client device to increase the server throughput, while not violating the service level agreement (SLA). We show in the experiments that we are able to efficiently distribute the workload allowing for roughly 1/3 reduction in the server workload, while achieving 19 percent improvement over a greedy method. As a result, we are able to demonstrate that, in an environment with different types of LLM inference requests, the throughput of the server is improved.
翻译:近年来,大规模语言模型(LLMs)已成为一项颠覆性创新,因其理解和生成类人文本的能力,在我们的日常生活中发挥着至关重要的作用。其能力涵盖自然语言理解、信息检索与搜索、翻译、聊天机器人、虚拟助手等诸多领域。然而众所周知,LLMs在参数量方面极为庞大。此外,其底层架构Transformer中的自注意力机制在计算和内存方面均随输入序列长度呈二次方复杂度。基于这些原因,LLM推理过程资源密集,导致其吞吐量受限,尤其对于较长序列而言。本报告设计了一种服务器与客户端间的协同推理架构以缓解吞吐量限制。在此设计中,我们综合考虑了双方的可用资源,即计算与通信成本。我们开发了一种基于动态规划的算法,用于在服务器与客户端设备间优化分配计算任务,从而在确保不违反服务等级协议(SLA)的前提下提升服务器吞吐量。实验表明,我们能够有效分配工作负载,使服务器负载减少约三分之一,同时相比贪心方法实现19%的性能提升。最终我们证明,在存在不同类型LLM推理请求的环境中,该方案能有效提升服务器吞吐量。