LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across requests (e.g., system prompts, tools/templates, RAG). Existing attention implementations fail to fully exploit prefix sharing: one-query-per-CTA execution repeatedly loads shared prefix KV cache, while one-size-fits-all tiling leaves on-chip resources idle and exacerbates bubbles for uneven KV lengths. These choices amplify memory bandwidth pressure and stall memory-bound decode attention. This paper introduces PAT, a prefix-aware attention kernel implementation for LLM decoding that organizes execution with a pack-forward-merge paradigm. PAT packs queries by shared prefix to reduce repeated memory accesses, runs a customized multi-tile kernel to achieve high resource efficiency. It further applies practical multi-stream forwarding and KV splitting to reduce resource bubbles. The final merge performs online softmax with negligible overhead. We implement PAT as an off-the-shelf plugin for vLLM. Evaluation on both real-world and synthetic workloads shows that PAT reduces attention latency by 53.5% on average and TPOT by 17.0-93.1% under the same configurations against state-of-the-art attention kernels. PAT's source code is publicly available at https://github.com/flashserve/PAT.
翻译:大语言模型服务日益受解码注意力主导,由于需从全局内存加载大量键值缓存,该操作属于内存受限操作。同时,实际工作负载在请求间存在大量层次化共享前缀(如系统提示词、工具/模板、RAG)。现有注意力实现未能充分利用前缀共享:单查询每CTA的执行方式会重复加载共享前缀键值缓存,而通用分块策略则使片上资源闲置并加剧键值长度不均导致的计算气泡。这些选择放大了内存带宽压力并阻塞了内存受限的解码注意力。本文提出PAT——一种面向大语言模型解码的前缀感知注意力内核实现,采用打包-前传-合并范式组织计算。PAT通过共享前缀打包查询以减少重复内存访问,运行定制化多分块内核以实现高资源利用率。进一步采用实用化多流前传与键值分割策略降低资源气泡。最终合并阶段通过在线softmax以可忽略开销完成计算。我们将PAT实现为vLLM的开箱即用插件。在实际与合成工作负载上的评估表明,在相同配置下,PAT相较于最先进注意力内核平均降低注意力延迟53.5%,TPOT指标提升17.0-93.1%。PAT源代码已在https://github.com/flashserve/PAT公开。