Deep learning based recommendation models (DLRM) are widely used in several business critical applications. Training such recommendation models efficiently is challenging because they contain billions of embedding-based parameters, leading to significant overheads from embedding access. By profiling existing systems for DLRM training, we observe that around 75\% of the iteration time is spent on embedding access and model synchronization. Our key insight in this paper is that embedding access has a specific structure which can be used to accelerate training. We observe that embedding accesses are heavily skewed, with around 1\% of embeddings representing more than 92\% of total accesses. Further, we observe that during offline training we can lookahead at future batches to determine exactly which embeddings will be needed at what iteration in the future. Based on these insights, we develop Bagpipe, a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation. We design an Oracle Cacher, a new component that uses a lookahead algorithm to generate optimal cache update decisions while providing strong consistency guarantees against staleness. We also design a logically replicated, physically partitioned cache and show that our design can reduce synchronization overheads in a distributed setting. Finally, we propose a disaggregated system architecture and show that our design can enable low-overhead fault tolerance. Our experiments using three datasets and four models show that Bagpipe provides a speed up of up to 5.6x compared to state of the art baselines, while providing the same convergence and reproducibility guarantees as synchronous training.
翻译:基于深度学习的推荐模型(DLRM)广泛应用于多个关键业务应用中。高效训练这类推荐模型具有挑战性,因为它们包含数十亿级别的嵌入参数,导致嵌入访问产生显著开销。通过对现有DLRM训练系统的性能分析,我们发现约75%的迭代时间消耗在嵌入访问和模型同步上。本文的核心洞察在于,嵌入访问具有可加速训练的特定结构。我们观察到嵌入访问呈现高度倾斜特性:约1%的嵌入承担了总访问量的92%以上。此外,我们发现离线训练阶段可通过前瞻未来批次,精确预测未来迭代中所需的全部嵌入。基于这些发现,我们开发了Bagpipe系统,通过缓存与预取技术实现远程嵌入访问与计算的流水线重叠。我们设计了一种新型Oracle缓存器,采用前瞻算法生成最优缓存更新决策,同时针对数据过时问题提供强一致性保障。我们还提出逻辑复制、物理分区的缓存架构,证明该设计可降低分布式环境下的同步开销。最终,我们提出解耦系统架构,实现低开销的容错机制。在三个数据集和四个模型上的实验表明,与当前最优基线相比,Bagpipe可实现最高5.6倍的加速,同时保持与同步训练相同的收敛性与可复现性保证。