An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is $27\times$ faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and $1.34\times$ faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is $2.8\times$ and $3.2\times$ than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems.

翻译：训练机器学习算法是一个计算密集型过程，由于需要反复访问大规模训练数据集，这一过程通常受内存瓶颈限制。因此，以处理器为中心的计算系统（如CPU、GPU）因数据在内存单元与处理单元之间的频繁移动而面临高昂的数据传输代价，消耗大量能量和计算周期。具备内存计算（Processing-in-Memory，PIM）能力的以内存为中心的计算系统能够缓解这一数据移动瓶颈。本文旨在探究现代通用PIM架构加速机器学习训练的潜力。为此，我们：（1）在真实通用PIM架构上实现了多种代表性经典机器学习算法（即线性回归、逻辑回归、决策树、K-Means聚类）；（2）从精度、性能和扩展性角度对其进行严格评估与特性分析；（3）与CPU和GPU上的对应实现进行对比。在包含2500个以上PIM核心的真实以内存为中心计算系统上的评估表明：当PIM硬件原生支持所需操作和数据类型时，通用PIM架构能够显著加速内存受限的机器学习工作负载。例如，我们的PIM决策树实现比基于8核Intel Xeon的最优CPU版本快27倍，比基于NVIDIA A100的最优GPU版本快1.34倍；PIM上的K-Means聚类相比最优CPU和GPU版本分别快2.8倍和3.2倍。据我们所知，本文是首个在真实PIM架构上评估机器学习训练的研究工作。最后，我们总结出关键发现、要点和建议，可为机器学习工作负载用户、PIM架构程序员以及未来以内存为中心计算系统的硬件设计者与架构师提供启发。