Large Language Model (LLM) inference on large-scale systems is expected to dominate future cloud infrastructures. Efficient LLM inference in cloud environments with numerous AI accelerators is challenging, necessitating extensive optimizations for optimal performance. Current systems batch prefill and decoding to boost throughput but encounter latency issues, while others disaggregate these phases, leading to resource underutilization. We propose AcceLLM, a novel method addressing latency and load balancing, inspired by the cache data management. It strategically utilizes redundant data to enhance inference via load balancing and optimal hardware use. Simulated evaluations on Nvidia H100 GPU and Huawei Ascend 910B2 show AcceLLM surpasses state-of-the-art systems up to 30% in latency and efficiency, handling diverse workloads effectively.
翻译:大规模系统上的大语言模型(LLM)推理预计将主导未来云基础设施。在配备众多AI加速器的云环境中实现高效LLM推理具有挑战性,需要对性能进行深度优化以达最优。现有系统通过批量处理预填充和解码阶段来提升吞吐量,但面临延迟问题;另一些系统则将这两个阶段解耦,导致资源利用率不足。受缓存数据管理机制启发,我们提出AcceLLM这一创新方法,旨在解决延迟与负载均衡问题。该方法通过策略性利用冗余数据,借助负载均衡与硬件优化使用来增强推理性能。在Nvidia H100 GPU与华为昇腾910B2上的模拟评估表明,AcceLLM在延迟与效率方面超越现有最优系统达30%,并能有效处理多样化工作负载。