Efficient large-scale inference of transformer-based large language models (LLMs) remains a fundamental systems challenge, frequently requiring multi-GPU parallelism to meet stringent latency and throughput targets. Conventional tensor parallelism decomposes matrix operations across devices but introduces substantial inter-GPU synchronization, leading to communication bottlenecks and degraded scalability. We propose the Parallel Track (PT) Transformer, a novel architectural paradigm that restructures computation to minimize cross-device dependencies. PT achieves up to a 16x reduction in synchronization operations relative to standard tensor parallelism, while maintaining competitive model quality in our experiments. We integrate PT into two widely adopted LLM serving stacks-Tensor-RT-LLM and vLLM-and report consistent improvements in serving efficiency, including up to 15-30% reduced time to first token, 2-12% reduced time per output token, and up to 31.90% increased throughput in both settings.
翻译:基于Transformer的大语言模型(LLM)的高效大规模推理仍然是一个基础性的系统挑战,通常需要多GPU并行来满足严格的延迟和吞吐量目标。传统的张量并行将矩阵运算分解到多个设备上,但引入了大量的GPU间同步,导致通信瓶颈和可扩展性下降。我们提出了并行轨道(PT)Transformer,这是一种新颖的架构范式,它通过重构计算来最小化跨设备依赖。在我们的实验中,相对于标准的张量并行,PT实现了同步操作高达16倍的减少,同时保持了有竞争力的模型质量。我们将PT集成到两个广泛采用的LLM服务栈——Tensor-RT-LLM和vLLM——中,并报告了服务效率的持续提升,包括首令牌生成时间减少高达15-30%,每个输出令牌时间减少2-12%,以及在两种设置下吞吐量提升高达31.90%。