We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features' values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.
翻译:本文提出RecD(推荐系统去重技术),一套跨越深度学习推荐模型(DLRM)训练管道的端到端基础设施优化方案。RecD旨在解决工业级DLRM训练数据集中固有特征重复所导致的存储、预处理和训练巨大开销问题。特征重复源于DLRM数据集通过交互行为生成:虽然每个用户会话可产生多个训练样本,但许多特征值在这些样本中保持不变。我们展示了RecD如何利用这一特性,在已部署的训练管道中实现端到端优化。RecD通过优化数据生成管道来降低数据集存储和预处理资源需求,同时最大化训练批次内的重复率。RecD引入新型张量格式——逆关键连接张量(IKJTs),用于对每个批次中的特征值进行去重。我们阐述了DLRM模型架构如何利用IKJTs显著提升训练吞吐量。在工业级DLRM训练系统中,RecD分别将训练吞吐量、预处理吞吐量和存储效率提升了最高2.48倍、1.79倍和3.71倍。