We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features' values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.
翻译:我们提出RecD(推荐系统去重技术)——一套针对深度学习推荐模型(DLRM)训练管道的端到端基础设施优化方案。RecD致力于解决工业级DLRM训练数据集中因特征重复性固有特性引发的存储、预处理及训练方面的巨大开销。特征重复性源于DLRM数据集由交互行为生成:每个用户会话可产生多个训练样本,但诸多特征值在这些样本中保持不变。我们展示了RecD如何利用这一特性,在已部署的训练管道中实现端到端优化。RecD优化数据生成管道以降低数据集存储与预处理资源需求,并最大化训练批次内的重复性。RecD引入新型张量格式——逆键值锯齿张量(IKJTs)——用以去除每个批次中的特征值重复性。我们阐述了DLRM模型架构如何利用IKJTs显著提升训练吞吐量。在工业级DLRM训练系统中,RecD使训练吞吐量、预处理吞吐量及存储效率分别提升至2.48倍、1.79倍及3.71倍。