Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google's Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures--particularly under the dynamic and ragged execution patterns common in modern serving. In this paper, we present Ragged Paged Attention (RPA), a high-performance and flexible attention kernel for TPUs, implemented using Pallas and Mosaic. RPA addresses these challenges through three key techniques: (1) fine-grained tiling to enable efficient dynamic slicing over ragged memory, (2) a custom software pipeline that fuses KV cache updates with attention computation, and (3) a distribution-aware compilation strategy that generates specialized kernels for decode, prefill, and mixed workloads. Evaluated on Llama 3 8B on TPU7x, RPA achieves up to 86% memory bandwidth utilization (MBU) in decode and 73% model FLOPs utilization (MFU) in prefill. Integrated as the primary TPU backend in vLLM and SGLang, RPA provides a production-grade foundation for efficient TPU inference and offers practical insights into kernel design.
翻译:大语言模型的部署正日益转向Google张量处理单元等成本高效的加速器,同时兼顾性能与总拥有成本。然而,现有大语言模型推理内核与服务系统仍以GPU为主导,缺乏将大语言模型工作负载高效映射至TPU架构的成熟方案——尤其面对现代服务中常见的动态与非连续执行模式。本文提出非连续分页注意力机制(RPA),一种基于Pallas和Mosaic实现的高性能灵活注意力内核。RPA通过三项关键技术解决上述挑战:(1)细粒度分块技术,实现对非连续存储的高效动态切片;(2)融合KV缓存更新与注意力计算的自定义软件流水线;(3)面向解码、预填充及混合工作负载生成专用内核的分布感知编译策略。在TPU7x上针对Llama 3 8B模型的评估显示,RPA在解码阶段达到86%的内存带宽利用率,预填充阶段达到73%的模型算力利用率。作为vLLM和SGLang系统中TPU后端的主要实现,RPA为高效TPU推理提供了生产级基础,并为内核设计提供了实践洞见。