Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications reduces combined token throughput. On the other hand, data parallelism (DP) obtains a higher throughput yet is slow in response latency. Best of both worlds does not exist, and it is not possible to combine TP and DP because of the KV cache variance across the parallelisms. We notice Sequence Parallelism (SP - Ulysses in training) has similar properties as DP but with KV cache invariance. We adapt SP to inference, and combine it with TP to get the best of both worlds. Our solution: Shift Parallelism. Shift Parallelism dynamically switches across TP and SP, and minimizes latency in low traffic without losing throughput in high traffic. The efficient GPU communications of Shift Parallelism yields up to i) 1.51x faster response in interactive workloads and ii) 50% higher throughput in batch workloads, compared to a TP-only solution. We evaluate Shift Parallelism with real-world production traces with dynamic traffic patterns as well as synthetic benchmarking patterns across models, context sizes, and arrival rates. All results affirm the same: Shift Parallelism has a better the latency vs. throughput tradeoff than TP or DP, and hence obtains low latency without degrading throughput in dynamic workloads.
翻译:实现大语言模型(LLM)的低延迟、高吞吐量推理需要高效的并行技术。张量并行(TP)是当前降低LLM响应延迟的最先进方法,但GPU通信会降低综合令牌吞吐量。另一方面,数据并行(DP)能获得更高的吞吐量,但响应延迟较慢。两者无法兼得,且由于不同并行方式间键值缓存(KV cache)的差异,TP与DP无法直接结合。我们注意到序列并行(SP——训练中的Ulysses方法)具有与DP相似的特性,但保持KV缓存不变。我们将SP适配至推理场景,并将其与TP结合,从而兼取二者之长。我们的解决方案:移位并行。移位并行能够在TP与SP之间动态切换,在低流量时最小化延迟,同时在高流量时不损失吞吐量。与纯TP方案相比,移位并行凭借高效的GPU通信,可实现:i) 在交互式工作负载中响应速度提升最高达1.51倍;ii) 在批处理工作负载中吞吐量提升50%。我们使用具有动态流量模式的实际生产轨迹以及跨模型、上下文长度和到达率的合成基准测试模式对移位并行进行评估。所有结果均表明:移位并行在延迟与吞吐量的权衡上优于TP或DP,从而能在动态工作负载中实现低延迟且不降低吞吐量。