Large Language Models (LLMs) are transforming recommendation from ranking into a generative task, but industrial deployment remains limited by the high latency of processing long, personalized prompts. Standard prefix caching provides limited benefit because reuse in recommendation workloads is often non-contiguous across user histories and item contexts. We present RcLLM, a distributed inference system for generative recommendation with Beyond-Prefix KV Caching. RcLLM decomposes prompts into reusable blocks and supports large item catalogs with a stratified distributed storage design: compact user-history caches are replicated for zero-latency retrieval, while massive item caches are sharded using similarity-aware placement. To reduce redundant quadratic attention computation, RcLLM combines an affinity-based global scheduler that improves data locality with a selective attention mechanism that corrects approximation errors. Experiments on real-world datasets show that RcLLM reduces Time-To-First-Token (TTFT) by 1.31x-9.51x compared with state-of-the-art prefix caching systems, enabling real-time serving with negligible impact on recommendation accuracy.
翻译:大型语言模型(LLM)正在将推荐从排序任务转变为生成任务,但工业部署仍受限于处理长个性化提示的高延迟。标准前缀缓存在推荐工作负载中的收益有限,因为用户历史与物品上下文的重复利用往往是非连续的。我们提出RcLLM——一种采用超越前缀KV缓存的生成式推荐分布式推理系统。该系统将提示分解为可复用模块,并通过分层分布式存储设计支持大型物品目录:紧凑的用户历史缓存通过复制实现零延迟检索,而庞大的物品缓存则采用基于相似性的分片放置策略。为减少冗余的二次注意力计算,RcLLM通过结合提升数据局部性的亲和性全局调度器与修正近似误差的选择性注意力机制。在真实数据集上的实验表明,与最先进的前缀缓存系统相比,RcLLM将首Token生成延迟(TTFT)降低1.31倍至9.51倍,在保持推荐准确度几乎不受影响的情况下实现实时服务。