TokenStack: A Heterogeneous HBM-PIM Architecture and Runtime for Efficient LLM Inference

Large language model (LLM) serving is now limited by the key-value (KV) cache. During decode, each new token rereads prior KV state, so attention becomes a bandwidth- and capacity-heavy memory task. HBM-PIM helps by moving attention closer to memory, but current stack organizations still waste resources. In practice, only hot KV blocks benefit from near-memory compute. Weights, activations, and cold KV mainly need dense storage and GPU-visible bandwidth. A uniform HBM-PIM stack makes all layers pay for PIM logic, while a dedicated-PIM design such as AttAcc recovers capacity but shrinks the HBM bandwidth left for GPU-side work. We propose TokenStack, a vertically heterogeneous HBM-PIM architecture for KV-centric LLM serving that leverages HBM4's logic-die substrate. TokenStack separates each stack into dense capacity layers and PIM-enabled compute layers, then uses the logic base die as a stack-local control point that manages cross-layer movement without host-side overhead. The base-die controller handles cross-layer DMA, layered address translation, attention-side gather/broadcast coordination, and inline quantization during migration. On top of this hardware, TokenStack uses topology-aware KV placement, workload-aware eviction, and bounded replication to keep hot KV near PIM compute while moving colder state to dense layers. Using production-derived traces across four models, completed multi-QPS runs show that TokenStack increases geometric-mean token throughput by 1.62x and SLO-compliant serving capacity by 1.70x over AttAcc, and reduces per-token energy by 30-47%.

翻译：大型语言模型（LLM）的服务能力当前受限于键值（KV）缓存。在解码阶段，每个新生成的令牌需重新读取先前的KV状态，使得注意力机制成为高带宽消耗与大容量需求的内存密集型任务。HBM-PIM架构通过将注意力计算迁移至内存近旁来缓解此问题，但当前的堆叠组织方式仍存在资源浪费。实践中，仅“热点”KV块能受益于近内存计算；权重、激活值及“冷”KV主要依赖高密度存储与GPU可见带宽。统一式HBM-PIM堆叠迫使所有层为PIM逻辑单元付出代价，而专用PIM设计（如AttAcc）虽可恢复容量，却缩减了GPU侧可用的HBM带宽。为此，我们提出TokenStack——一种面向KV中心型LLM服务的垂直异构HBM-PIM架构，其基于HBM4的逻辑晶圆基板。TokenStack将每个堆叠分离为高密度容量层与支持PIM的计算层，并利用逻辑基片作为堆叠本地控制点，以管理跨层数据移动且避免宿主端开销。基片控制器负责跨层DMA、分层地址转换、注意力侧的聚集/广播协调以及迁移过程中的内联量化。在此硬件基础上，TokenStack采用拓扑感知的KV放置、负载感知的逐出策略及有界复制技术，使热点KV保持接近PIM计算单元，同时将冷状态迁移至密集存储层。基于四种模型的生产级轨迹测试表明，在完整的多QPS运行下，相较AttAcc，TokenStack的几何平均令牌吞吐量提升1.62倍，符合SLO的服务容量提升1.70倍，且每令牌能耗降低30%-47%。