BagPipe: Accelerating Deep Recommendation Model Training

Deep learning based recommendation models (DLRM) are widely used in several business critical applications. Training such recommendation models efficiently is challenging because they contain billions of embedding-based parameters, leading to significant overheads from embedding access. By profiling existing systems for DLRM training, we observe that around 75\% of the iteration time is spent on embedding access and model synchronization. Our key insight in this paper is that embedding access has a specific structure which can be used to accelerate training. We observe that embedding accesses are heavily skewed, with around 1\% of embeddings representing more than 92\% of total accesses. Further, we observe that during offline training we can lookahead at future batches to determine exactly which embeddings will be needed at what iteration in the future. Based on these insights, we develop Bagpipe, a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation. We design an Oracle Cacher, a new component that uses a lookahead algorithm to generate optimal cache update decisions while providing strong consistency guarantees against staleness. We also design a logically replicated, physically partitioned cache and show that our design can reduce synchronization overheads in a distributed setting. Finally, we propose a disaggregated system architecture and show that our design can enable low-overhead fault tolerance. Our experiments using three datasets and four models show that Bagpipe provides a speed up of up to 5.6x compared to state of the art baselines, while providing the same convergence and reproducibility guarantees as synchronous training.

翻译：基于深度学习的推荐模型（DLRM）广泛应用于多个关键业务应用中。高效训练此类推荐模型极具挑战性，因为模型包含数十亿基于嵌入的参数，导致嵌入访问产生显著开销。通过对现有DLRM训练系统的分析，我们观察到约75%的迭代时间耗费在嵌入访问与模型同步上。本文的核心洞察在于，嵌入访问具有特定结构，可据此加速训练。我们观察到嵌入访问呈现严重偏斜性：约1%的嵌入占据了总访问量的92%以上。此外，在离线训练过程中，我们可以通过预判未来批次，精确确定哪些嵌入将在未来哪个迭代中被需要。基于这些发现，我们开发了Bagpipe——一个通过缓存与预取技术将远程嵌入访问与计算重叠的深度推荐模型训练系统。我们设计了Oracle Cacher组件，它利用预判算法生成最优缓存更新决策，同时提供强一致性保证以防止数据陈旧。我们还设计了逻辑复制、物理分区的缓存结构，证明该设计能降低分布式环境中的同步开销。最终，我们提出一种解耦的系统架构，并证明该设计可实现低开销的容错机制。基于三个数据集和四个模型的实验表明，与现有最优基线相比，Bagpipe在提供与同步训练相同的收敛性与可重现性保证的同时，实现了最高5.6倍的加速比。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日