Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating more memory to one improves its efficiency but degrades the other. Existing systems optimize them in isolation, overlooking that the optimal EMB-KV allocation ratio can shift by up to 0.35 across workload regimes, leaving 20-30\% latency improvement unrealized. While online reallocation is required to close this gap, naive approaches introduce H2D refill traffic on the critical path, causing P99 SLO violations. To address this, we present HELM, which jointly manages HBM allocation and request routing at runtime through two key components: (1) Adaptive Memory Allocation, a three-layer PPO-based controller (frozen base policy, online residual adapter, and burst-aware recovery controller) that achieves $32\,\mathrm{μs}$ decision latency while staying within 0.024-0.029 of the offline-optimal ratio; and (2) EMB-KV-Aware Scheduling, which routes requests by jointly considering KV residency, embedding locality, and node load to avoid routing inefficiencies under heterogeneous allocations. Evaluations on three production-scale datasets over a 32-node A100 cluster show that HELM reduces P99 latency by 24-38\% over the best static policy and achieves 93.5-99.6\% SLO satisfaction across Steady, Trend, and Burst workloads, significantly outperforming state-of-the-art baselines without sacrificing throughput.
翻译:生成式推荐(GR)推理过程中,嵌入热缓存(EMB)与KV缓存会直接竞争有限的GPU高带宽内存(HBM):若为某个缓存分配更多内存,虽能提升其自身效率,却会损害另一缓存性能。现有系统对两者进行孤立优化,忽视了最优EMB-KV分配比例在不同负载场景下可能偏移高达0.35,导致20-30%的延迟改进无法实现。虽然在线重分配是弥补此差距的必要手段,但朴素方法会在关键路径引入H2D重填流量,引发P99 SLO违规。为解决该问题,我们提出HELM系统,通过两个核心组件在运行时协同管理HBM分配与请求路由:(1)自适应内存分配——一种三层PPO控制器(包含冻结基础策略、在线残差适配器、突发感知恢复控制器),可在保持决策延迟为$32\,\mathrm{μs}$的同时,使分配比例与离线最优值的偏差控制在0.024-0.029范围内;(2)EMB-KV感知调度——通过联合考量KV驻留状态、嵌入局部性与节点负载来路由请求,避免异构分配场景下的路由效率损失。在32节点A100集群上基于三个生产规模数据集的评估表明:HELM相比最优静态策略可降低24-38%的P99延迟,在稳定、趋势和突发三类负载下实现93.5-99.6%的SLO满足率,且在不牺牲吞吐量的前提下显著超越现有最先进基线方法。