Deep Learning Recommendation Models (DLRMs) underpin personalized services but face a critical freshness-accuracy tradeoff due to massive parameter synchronization overheads. Production DLRMs deploy decoupled training/inference clusters, where synchronizing petabyte-scale embedding tables (EMTs) causes multi-minute staleness, degrading recommendation quality and revenue. We observe that (1) inference nodes exhibit sustained CPU underutilization (peak <= 20%), and (2) EMT gradients possess intrinsic low-rank structure, enabling compact update representation. We present LiveUpdate, a system that eliminates inter-cluster synchronization by colocating Low-Rank Adaptation (LoRA) trainers within inference nodes. LiveUpdate addresses two core challenges: (1) dynamic rank adaptation via singular value monitoring to constrain memory overhead (<2% of EMTs), and (2) NUMA-aware resource scheduling with hardware-enforced QoS to eliminate update inference contention (P99 latency impact <20ms). Evaluations show LiveUpdate reduces update costs by 2x versus delta-update baselines while achieving higher accuracy within 1-hour windows. By transforming idle inference resources into freshness engines, LiveUpdate delivers online model updates while outperforming state-of-the-art delta-update methods by 0.04% to 0.24% in accuracy.
翻译:深度学习推荐模型(DLRMs)支撑着个性化服务,但由于海量参数同步开销,面临着关键的新鲜度与准确性权衡问题。生产环境中的DLRMs通常部署解耦的训练/推理集群,同步PB级嵌入表(EMTs)会导致数分钟的延迟,从而降低推荐质量和收入。我们观察到:(1)推理节点持续存在CPU利用率不足(峰值≤20%);(2)EMT梯度具有固有的低秩结构,可实现紧凑的更新表示。我们提出LiveUpdate系统,通过在推理节点内部署低秩自适应(LoRA)训练器,消除集群间同步开销。LiveUpdate解决了两个核心挑战:(1)通过奇异值监测实现动态秩自适应,以限制内存开销(<EMTs的2%);(2)采用NUMA感知的资源调度与硬件强化的QoS机制,消除更新对推理的干扰(P99延迟影响<20ms)。评估表明,LiveUpdate将更新成本较基线增量更新方法降低2倍,并在1小时窗口内实现更高准确性。通过将闲置推理资源转化为新鲜度引擎,LiveUpdate在提供在线模型更新的同时,其准确性较最先进的增量更新方法提升0.04%至0.24%。