With the increasing volumes of Large Language Models (LLMs) and the expanding context lengths, attention computation has become a key performance bottleneck in LLM serving. For fast attention computation, recent practices often parallelize the attention heads on multiple GPUs, and also widely adopt attention sparsification to reduce the computation amount -- which selectively computes a subset of attention pairs under a preset sparsity budget. In this paper, we notice that attention heads of an LLM model often exhibit heterogeneous-yet-stable sparsity elasticities, which motivates us to enforce head-adaptive sparsity budgets to attain better efficiency while preserving high inference quality. Yet, from the system aspect, with heterogeneous sparsity levels, attention computation time on different heads would be inconsistent, yielding cross-GPU resource bubbles under head-parallel deployment. To further minimize such bubbles, we propose a novel attention deployment strategy called Sparsity-aware Head-Parallel Load Balance (S-HPLB). Experiments on long-context benchmark show that, S-HPLB can achieve a $2.88\times$ improvement in average attention computation latency without quality degradation.
翻译:随着大语言模型规模的不断增长与上下文长度的持续扩展,注意力计算已成为大语言模型服务中的关键性能瓶颈。为实现快速注意力计算,现有方案通常将注意力头并行部署于多个GPU上,并广泛采用注意力稀疏化技术以降低计算量——即在预设的稀疏度预算下选择性计算部分注意力对。本文发现,大语言模型的注意力头往往表现出异构且稳定的稀疏弹性特性,这促使我们采用头部自适应的稀疏度预算策略,在保持高推理质量的同时获得更优效率。然而从系统层面看,在异构稀疏度条件下,不同注意力头的计算时间将出现差异,导致在头部并行部署时产生跨GPU资源空泡。为最小化此类空泡,我们提出一种新颖的注意力部署策略:稀疏感知头部并行负载均衡。长上下文基准测试表明,S-HPLB可在保持推理质量不变的前提下,将平均注意力计算延迟降低至原来的$2.88\times$。