Modern Machine Learning (ML) training on large-scale datasets is a very time-consuming workload. It relies on the optimization algorithm Stochastic Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance. Processor-centric architectures (e.g., CPUs, GPUs) commonly used for modern ML training workloads based on SGD are bottlenecked by data movement between the processor and memory units due to the poor data locality in accessing large datasets. As a result, processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory. Our goal is to understand the capabilities of popular distributed SGD algorithms on real-world PIM systems to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized parallel SGD algorithms on the real-world UPMEM PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware and highlight the need for a shift to an algorithm-hardware codesign. Our results demonstrate three major findings: 1) The UPMEM PIM system can be a viable alternative to state-of-the-art CPUs and GPUs for many memory-bound ML training workloads, especially when operations and datatypes are natively supported by PIM hardware, 2) it is important to carefully choose the optimization algorithms that best fit PIM, and 3) the UPMEM PIM system does not scale approximately linearly with the number of nodes for many data-intensive ML training workloads. We open source all our code to facilitate future research.
翻译:现代机器学习(ML)在大规模数据集上的训练是一项非常耗时的任务。由于其高效性、简单性和泛化性能,该任务依赖于随机梯度下降(SGD)这一优化算法。当前基于SGD的现代ML训练工作负载通常采用的以处理器为中心(如CPU、GPU)的架构,由于访问大规模数据集时数据局部性差,处理器与内存单元之间的数据移动成为瓶颈。因此,以处理器为中心的架构在执行ML训练工作负载时存在性能低下和能耗高的问题。存内处理(PIM)是一种有前景的解决方案,它通过将计算机制置于内存内部或附近来缓解数据移动瓶颈。我们的目标是理解流行的分布式SGD算法在真实世界PIM系统上的能力,以加速数据密集型ML训练工作负载。为此,我们:1)在真实的UPMEM PIM系统上实现了多种代表性的集中式并行SGD算法;2)在大规模数据集上对这些算法进行ML训练的性能、精度和可扩展性方面的严格评估;3)与传统的CPU和GPU基线进行比较;4)讨论对未来PIM硬件的启示,并强调转向算法-硬件协同设计的必要性。我们的结果展示了三个主要发现:1)对于许多内存受限的ML训练工作负载,尤其是当操作和数据类型得到PIM硬件原生支持时,UPMEM PIM系统可以成为先进CPU和GPU的可行替代方案;2)谨慎选择最适合PIM的优化算法至关重要;3)对于许多数据密集型ML训练工作负载,UPMEM PIM系统的性能并未随节点数量近似线性扩展。我们开源了所有代码以促进未来研究。