RelayGR：通过跨阶段接力推理扩展长序列生成式推荐系统 (RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference)

Jiarui Wang,Huichao Chai,Yuanhang Zhang,Zongjin Zhou,Wei Guo,Xingkun Yang,Qiang Tang,Bo Pan,Jiawei Zhu,Ke Cheng,Yuting Yan,Shulan Wang,Yingjie Zhu,Zhengfan Yuan,Jiaqi Huang,Yuhan Zhang,Xiaosong Sun,Zhinan Zhang,Hong Zhu,Yongsheng Zhang,Tiantian Dong,Zhong Xiao,Deliang Liu,Chengzhou Lu,Yuan Sun,Zhiyuan Chen,Xinming Han,Zaizhu Liu,Yaoyuan Wang,Ziyang Zhang,Yong Liu,Jinxin Xu,Yajing Sun,Zhoujun Yu,Wenting Zhou,Qidong Zhang,Zhengyong Zhang,Zhonghai Gu,Yibo Jin,Yongxiang Feng,Pengfei Zuo

Real-time recommender systems execute multi-stage cascades (retrieval, pre-processing, fine-grained ranking) under strict tail-latency SLOs, leaving only tens of milliseconds for ranking. Generative recommendation (GR) models can improve quality by consuming long user-behavior sequences, but in production their online sequence length is tightly capped by the ranking-stage P99 budget. We observe that the majority of GR tokens encode user behaviors that are independent of the item candidates, suggesting an opportunity to pre-infer a user-behavior prefix once and reuse it during ranking rather than recomputing it on the critical path. Realizing this idea at industrial scale is non-trivial: the prefix cache must survive across multiple pipeline stages before the final ranking instance is determined, the user population implies cache footprints far beyond a single device, and indiscriminate pre-inference would overload shared resources under high QPS. We present RelayGR, a production system that enables in-HBM relay-race inference for GR. RelayGR selectively pre-infers long-term user prefixes, keeps their KV caches resident in HBM over the request lifecycle, and ensures the subsequent ranking can consume them without remote fetches. RelayGR combines three techniques: 1) a sequence-aware trigger that admits only at-risk requests under a bounded cache footprint and pre-inference load, 2) an affinity-aware router that co-locates cache production and consumption by routing both the auxiliary pre-infer signal and the ranking request to the same instance, and 3) a memory-aware expander that uses server-local DRAM to capture short-term cross-request reuse while avoiding redundant reloads. We implement RelayGR on Huawei Ascend NPUs and evaluate it with real queries. Under a fixed P99 SLO, RelayGR supports up to 1.5$\times$ longer sequences and improves SLO-compliant throughput by up to 3.6$\times$.

翻译：实时推荐系统在严格的尾部延迟服务等级目标约束下执行多级级联流程（召回、预处理、精排），仅能为排序阶段预留数十毫秒时间。生成式推荐模型能够通过处理长用户行为序列提升推荐质量，但在实际生产中，其在线序列长度受限于排序阶段P99时延预算。我们观察到，生成式推荐模型中大部分词元编码的用户行为与候选物品无关，这表明存在一种优化可能：可预先推断用户行为前缀一次，并在排序阶段复用，而非在关键路径上重复计算。在工业规模实现该构想面临诸多挑战：前缀缓存必须在最终排序实例确定前跨多个流水线阶段保持有效；用户规模意味着缓存占用远超单设备容量；不加选择的预推断会在高查询率下使共享资源过载。本文提出RelayGR，一个支持生成式推荐在高速缓冲存储器内进行接力推理的生产系统。RelayGR选择性预推断长周期用户前缀，在请求生命周期内将其键值缓存驻留于高速缓冲存储器，并确保后续排序阶段无需远程获取即可直接使用。RelayGR融合三项核心技术：1）序列感知触发器，在限定缓存占用和预推断负载下仅允许风险请求进入；2）亲和性感知路由器，通过将辅助预推断信号与排序请求路由至同一实例，实现缓存生成与消费的协同定位；3）内存感知扩展器，利用服务器本地动态随机存取内存捕获短期跨请求复用，同时避免冗余重载。我们在华为昇腾NPU上实现RelayGR并使用真实查询进行评估。在固定P99服务等级目标下，RelayGR支持序列长度提升至1.5$\times$，并将符合服务等级目标的吞吐量提升最高达3.6$\times$。