In Large Language Model (LLM) inference services, it is challenging to make a parallelism strategy configuration, to efficiently process the requests of variance context lengths. Requests of long context require high degree of parallelism to provide more memory for Key-Value (KV) Cache, while requests of short context prefer low degree of parallelism to increase concurrency, thus improving throughput. To maintain high throughput while supporting large context lengths on demand, we propose Amoeba, a runtime Tensor Parallel (TP) transformation for online LLM inference services, which adaptively adjusts the TP of running instances to align with the dynamics of incoming requests. Evaluations using real-world traces show that Amoeba improves throughput by 1.75x-6.57x compared to state-of-the-art solutions.
翻译:在大语言模型(LLM)推理服务中,针对上下文长度差异显著的请求高效配置并行策略具有挑战性。长上下文请求需要高并行度以提供更多键值缓存(KV Cache)内存,而短上下文请求则倾向于低并行度以提升并发性,从而提高吞吐量。为了在按需支持大上下文长度的同时保持高吞吐量,我们提出Amoeba——一种面向在线LLM推理服务的运行时张量并行(TP)变换方法,该方法能够自适应调整运行实例的TP配置,使其与动态变化的请求特征相匹配。基于真实世界轨迹的评估表明,与现有最优方案相比,Amoeba可将吞吐量提升1.75倍至6.57倍。