Deployers of online LLM services usually seek to maximize cluster-wide performance given a fixed number of GPUs. Tensor parallelism (TP) is necessary to fit modern models but scales sub-linearly as the TP degree t grows, due to cross-GPU communication and non-scalable runtime work, as predicted by Amdahl's Law. Conversely, increasing t improves memory efficiency and alleviates KV-cache contention and swapping. We identify and validate an empirical optimal TP degree t_e that balances these effects. We present Albireo, a parallel inference system that raises the attainable t_e by shrinking the non-scalable portion via overlap of scheduling and I/O with compute and sequence-parallel sampling, without changing model architectures. Across models and benchmarks, Albireo achieves up to 1.9x higher throughput, 48% lower latency, 28% higher GPU utilization, and 54% lower energy than vLLM; in production it yields up to 2x higher throughput.
翻译:在线LLM服务的部署者通常希望在固定GPU数量下最大化集群性能。张量并行(TP)是适配现代模型的必要手段,但随着并行度t增加,其缩放效果呈次线性增长——这是由跨GPU通信及不可扩展的运行时开销所导致的,符合阿姆达尔定律的预测。反之,增大t可提升内存效率,并缓解KV缓存争用与交换问题。我们识别并验证了能够平衡这些效应的经验最优TP度t_e。本文提出Albireo并行推理系统,该系统通过将调度与I/O操作重叠至计算与序列并行采样过程中,在不改变模型架构的前提下压缩不可扩展部分,从而提升可达的t_e值。在多个模型与基准测试中,Albireo相较vLLM实现了最高1.9倍的吞吐量提升、48%的延迟降低、28%的GPU利用率提升以及54%的能耗降低;在生产环境中,其吞吐量提升可达2倍。