ALPHA-PIM: Analysis of Linear Algebraic Processing for High-Performance Graph Applications on a Real Processing-In-Memory System

Processing large-scale graph datasets is computationally intensive and time-consuming. Processor-centric CPU and GPU architectures, commonly used for graph applications, often face bottlenecks caused by extensive data movement between the processor and memory units due to low data reuse. As a result, these applications are often memory-bound, limiting both performance and energy efficiency due to excessive data transfers. Processing-In-Memory (PIM) offers a promising approach to mitigate data movement bottlenecks by integrating computation directly within or near memory. Although several previous studies have introduced custom PIM proposals for graph processing, they do not leverage real-world PIM systems. This work aims to explore the capabilities and characteristics of common graph algorithms on a real-world PIM system to accelerate data-intensive graph workloads. To this end, we (1) implement representative graph algorithms on UPMEM's general-purpose PIM architecture; (2) characterize their performance and identify key bottlenecks; (3) compare results against CPU and GPU baselines; and (4) derive insights to guide future PIM hardware design. Our study underscores the importance of selecting optimal data partitioning strategies across PIM cores to maximize performance. Additionally, we identify critical hardware limitations in current PIM architectures and emphasize the need for future enhancements across computation, memory, and communication subsystems. Key opportunities for improvement include increasing instruction-level parallelism, developing improved DMA engines with non-blocking capabilities, and enabling direct interconnection networks among PIM cores to reduce data transfer overheads.

翻译：大规模图数据集的处理计算密集且耗时。常用于图应用的以处理器为中心的CPU和GPU架构，由于数据重用率低，经常面临处理器与存储单元间大量数据移动造成的瓶颈。因此，这些应用通常受限于内存，过多的数据传输限制了性能和能效。存内计算（PIM）通过将计算直接集成在内存内部或附近，为缓解数据移动瓶颈提供了一种有前景的途径。尽管先前已有若干研究提出了针对图处理的定制PIM方案，但它们并未利用真实的PIM系统。本研究旨在探索常见图算法在真实PIM系统上的能力与特性，以加速数据密集型图工作负载。为此，我们（1）在UPMEM的通用PIM架构上实现了代表性的图算法；（2）分析了其性能并识别了关键瓶颈；（3）将结果与CPU和GPU基准进行了比较；（4）得出了指导未来PIM硬件设计的见解。我们的研究强调了在PIM核心间选择最优数据分区策略以最大化性能的重要性。此外，我们识别了当前PIM架构中关键硬件限制，并强调了未来在计算、存储和通信子系统方面进行增强的必要性。关键的改进机会包括提高指令级并行度、开发具有非阻塞能力的改进型DMA引擎，以及建立PIM核心间的直接互连网络以减少数据传输开销。