Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators

Zhengyang Su,Isay Katsman,Yueqi Wang,Ruining He,Lukasz Heldt,Raghunandan Keshavan,Shao-Chuan Wang,Xinyang Yi,Mingyan Gao,Onkar Dalal,Lichan Hong,Ed Chi,Ningren Han

from arxiv, 14 pages, 4 figures

Generative retrieval has emerged as a powerful paradigm for LLM-based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high-throughput LLM-based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized sparse matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large-scale industrial video recommendation platform serving billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47-1033x speedup over a hardware-accelerated binary-search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production-scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval. Our code is available at https://github.com/youtube/static-constraint-decoding.

翻译：生成式检索已成为基于大语言模型（LLM）推荐系统的一种强大范式。然而，工业推荐系统通常需要根据业务逻辑（例如强制内容新鲜度或产品类别）将输出空间限制在项目的受约束子集内，而标准的自回归解码方法本身无法支持这一需求。此外，现有的利用前缀树（Trie）的约束解码方法在硬件加速器（TPU/GPU）上会产生严重的延迟开销。本文中，我们提出了STATIC（面向约束解码的稀疏转移矩阵加速前缀树索引），这是一种专为TPU/GPU上高吞吐量的基于LLM的生成式检索而设计的高效、可扩展的约束解码技术。通过将前缀树扁平化为静态的压缩稀疏行（CSR）矩阵，我们将不规则的前缀树遍历转化为完全向量化的稀疏矩阵运算，从而在硬件加速器上实现了显著的效率提升。我们在一个服务数十亿用户的大规模工业视频推荐平台上部署了STATIC。STATIC在产生显著产品指标提升的同时，仅引入极低的延迟开销（每步0.033毫秒，占推理时间的0.25%），相比CPU前缀树实现实现了948倍加速，相比硬件加速的二分查找基线实现了47-1033倍加速。此外，STATIC在广泛的实际配置下运行时开销始终保持极低水平。据我们所知，STATIC实现了首个生产规模的严格约束生成式检索部署。此外，在学术基准测试上的评估表明，STATIC能显著提升生成式检索的冷启动性能。我们的代码发布于 https://github.com/youtube/static-constraint-decoding。