The performance gap between memory and processor has grown rapidly. Consequently, the energy and wall-clock time costs associated with moving data between the CPU and main memory predominate the overall computational cost. The Processing-in-Memory (PIM) paradigm emerges as a promising architecture that mitigates the need for extensive data movements by strategically positioning computing units proximate to the memory. Despite the abundant efforts devoted to building a robust and highly-available PIM system, identifying PIM-friendly segments of applications poses significant challenges due to the lack of a comprehensive tool to evaluate the intrinsic memory access pattern of the segment. To tackle this challenge, we propose A$^3$PIM: an Automated, Analytic and Accurate Processing-in-Memory offloader. We systematically consider the cross-segment data movement and the intrinsic memory access pattern of each code segment via static code analyzer. We evaluate A$^3$PIM across a wide range of real-world workloads including GAP and PrIM benchmarks and achieve an average speedup of 2.63x and 4.45x (up to 7.14x and 10.64x) when compared to CPU-only and PIM-only executions, respectively.
翻译:内存与处理器之间的性能差距急剧扩大。因此,CPU与主存间数据迁移所产生的能耗和实际耗时在总体计算成本中占据主导地位。存内处理(Processing-in-Memory, PIM)范式作为一种有前景的架构应运而生,通过将计算单元战略性地部署在内存附近,显著减少大规模数据移动的需求。尽管已有大量研究致力于构建鲁棒且高可用的PIM系统,但由于缺乏能够全面评估代码段内在内存访问模式的工具,识别应用程序中适合PIM处理的片段仍面临重大挑战。为解决这一问题,我们提出A$^3$PIM:一种自动化、分析型且精确的存内处理卸载器。通过静态代码分析器,我们系统性地考虑了跨片段数据迁移以及每个代码段的内在内存访问模式。我们在包括GAP和PrIM基准测试在内的多种真实工作负载上评估了A$^3$PIM,与仅CPU执行和仅PIM执行相比,分别实现了平均2.63倍和4.45倍的加速(最高可达7.14倍和10.64倍)。