Attention efficiency is critical to large language model (LLM) inference. While prior advances optimize attention execution for individual requests (e.g., FlashAttention), production LLM serving relies on batching requests with highly heterogeneous sequence lengths for high serving throughput. This mismatch induces severe computation and I/O imbalance, exacerbates stragglers, and underutilizes GPU resources. We present PackInfer, a kernel-level attention framework that enables compute- and I/O-aware execution for heterogeneous batched inference. PackInfer orchestrates batched requests into load-balanced execution groups, effectively saturating GPU utilization by packing multiple requests into unified kernel launches. By constructing attention kernels directly over packed query-key regions, PackInfer eliminates redundant computation and balances thread-block execution. It then incorporates I/O-aware grouping that co-locates shared-prefix requests and reorganizes KV caches into group-contiguous layouts, reducing memory fragmentation and redundant data movement as generation evolves. Evaluations on real-world workloads show that PackInfer reduces inference latency by 13.0-20.1%, and improves throughput by 20% compared to the state-of-the-art FlashAttention.
翻译:注意力机制的计算效率对大语言模型(LLM)推理至关重要。尽管已有研究针对单次请求的注意力计算执行进行了优化(例如FlashAttention),但在实际生产环境中,LLM服务通常采用批处理方式处理序列长度高度异构的请求以实现高吞吐量。这种不匹配会导致严重的计算与I/O负载不均衡,加剧尾部延迟问题,并造成GPU资源利用率不足。本文提出PackInfer,一种内核级注意力计算框架,能够为异构批处理推理实现计算感知与I/O感知的执行。PackInfer将批处理请求编排至负载均衡的执行组中,通过将多个请求打包至统一的内核启动中,有效提升GPU利用率。通过直接在打包的查询-键值区域上构建注意力计算内核,PackInfer消除了冗余计算并平衡了线程块的执行。该框架进一步引入I/O感知分组机制,将共享前缀的请求集中处理,并将键值缓存重组为组连续的内存布局,从而在生成过程中减少内存碎片化与冗余数据移动。在真实场景工作负载上的实验表明,相较于当前最优的FlashAttention,PackInfer能够降低13.0-20.1%的推理延迟,并提升20%的吞吐量。