Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. Several companies are now building large compute clusters reserved only for DLRM training, driving new interest in cost- and time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning training jobs are dominated by model execution, the most important factor in DLRM training performance is often online data ingestion. In this paper, we explore the unique characteristics of this data ingestion problem and provide insights into DLRM training pipeline bottlenecks and challenges. We study real-world DLRM data processing pipelines taken from our compute cluster at Netflix to observe the performance impacts of online ingestion and to identify shortfalls in existing pipeline optimizers. We find that current tooling either yields sub-optimal performance, frequent crashes, or else requires impractical cluster re-organization to adopt. Our studies lead us to design and build a new solution for data pipeline optimization, InTune. InTune employs a reinforcement learning (RL) agent to learn how to distribute the CPU resources of a trainer machine across a DLRM data pipeline to more effectively parallelize data loading and improve throughput. Our experiments show that InTune can build an optimized data pipeline configuration within only a few minutes, and can easily be integrated into existing training workflows. By exploiting the responsiveness and adaptability of RL, InTune achieves higher online data ingestion rates than existing optimizers, thus reducing idle times in model execution and increasing efficiency. We apply InTune to our real-world cluster, and find that it increases data ingestion throughput by as much as 2.29X versus state-of-the-art data pipeline optimizers while also improving both CPU & GPU utilization.
翻译:基于深度学习的推荐模型(DLRMs)已成为现代推荐系统的核心组件。多家公司正在构建专用于DLRM训练的大型计算集群,这推动了节省成本与时间的优化研究的新兴趣。该场景面临的系统挑战具有独特性:典型深度学习训练任务主要由模型执行主导,而DLRM训练性能最关键的因素往往是在线数据摄取。本文探讨了数据摄取问题的独特特征,深入剖析了DLRM训练流水线的瓶颈与挑战。我们研究了Netflix计算集群中真实DLRM数据处理流水线,观测在线摄取对性能的影响,并识别现有流水线优化器的不足。研究发现,现有工具要么性能欠佳、频繁崩溃,要么需要不切实际的集群重组才能适配。基于这些研究,我们设计并构建了新的数据流水线优化方案InTune。InTune采用强化学习(RL)智能体,学习如何在DLRM数据流水线中分配训练机器的CPU资源,从而更有效地并行化数据加载并提升吞吐量。实验表明,InTune可在数分钟内构建优化的数据流水线配置,并轻松集成到现有训练工作流中。通过利用强化学习的响应性与自适应性,InTune实现了比现有优化器更高的在线数据摄取率,从而减少模型执行中的空闲时间并提高效率。我们在真实集群上应用InTune,发现其数据摄取吞吐量相比现有最先进的流水线优化器提升高达2.29倍,同时提高了CPU与GPU利用率。