Retrospective: A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

Our ISCA 2015 paper provides a new programmable processing-in-memory (PIM) architecture and system design that can accelerate key data-intensive applications, with a focus on graph processing workloads. Our major idea was to completely rethink the system, including the programming model, data partitioning mechanisms, system support, instruction set architecture, along with near-memory execution units and their communication architecture, such that an important workload can be accelerated at a maximum level using a distributed system of well-connected near-memory accelerators. We built our accelerator system, Tesseract, using 3D-stacked memories with logic layers, where each logic layer contains general-purpose processing cores and cores communicate with each other using a message-passing programming model. Cores could be specialized for graph processing (or any other application to be accelerated). To our knowledge, our paper was the first to completely design a near-memory accelerator system from scratch such that it is both generally programmable and specifically customizable to accelerate important applications, with a case study on major graph processing workloads. Ensuing work in academia and industry showed that similar approaches to system design can greatly benefit both graph processing workloads and other applications, such as machine learning, for which ideas from Tesseract seem to have been influential. This short retrospective provides a brief analysis of our ISCA 2015 paper and its impact. We briefly describe the major ideas and contributions of the work, discuss later works that built on it or were influenced by it, and make some educated guesses on what the future may bring on PIM and accelerator systems.

翻译：我们的ISCA 2015论文提出了一种新的可编程存内处理（PIM）架构与系统设计，能够加速关键数据密集型应用，重点聚焦图处理工作负载。核心思路在于彻底重构系统，涵盖编程模型、数据划分机制、系统支持、指令集架构，以及近内存执行单元及其通信架构，从而通过由紧密互联的近内存加速器构成的分布式系统，实现重要工作负载的最大化加速。我们利用含逻辑层的3D堆叠内存构建了加速器系统Tesseract，每个逻辑层包含通用处理核心，核心间通过消息传递编程模型相互通信。核心可针对图处理（或任何待加速的应用程序）进行专用化设计。据我们所知，该论文首次从零开始完整设计了一种近内存加速器系统，使其既具备通用可编程性，又能针对重要应用进行特定定制优化，并以主要图处理工作负载作为案例研究。随后的学术界与工业界工作表明，类似的系统设计方法可显著提升图处理工作负载及其他应用（如机器学习）的性能，其中Tesseract的思想似乎具有重要影响。这篇简要回顾对我们ISCA 2015论文及其影响进行了分析，概述了主要创新点与贡献，讨论了基于该工作或受其启发的后续研究，并对PIM与加速器系统的未来发展方向进行了合理推测。