Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications

Applications with low data reuse and frequent irregular memory accesses, such as graph or sparse linear algebra workloads, fail to scale well due to memory bottlenecks and poor core utilization. While prior work with prefetching, decoupling, or pipelining can mitigate memory latency and improve core utilization, memory bottlenecks persist due to limited off-chip bandwidth. Approaches doing processing in-memory (PIM) with Hybrid Memory Cube (HMC) overcome bandwidth limitations but fail to achieve high core utilization due to poor task scheduling and synchronization overheads. Moreover, the high memory-per-core ratio available with HMC limits strong scaling. We introduce Dalorex, a hardware-software co-design that achieves high parallelism and energy efficiency, demonstrating strong scaling with >16,000 cores when processing graph and sparse linear algebra workloads. Over the prior work in PIM, both using 256 cores, Dalorex improves performance and energy consumption by two orders of magnitude through (1) a tile-based distributed-memory architecture where each processing tile holds an equal amount of data, and all memory operations are local; (2) a task-based parallel programming model where tasks are executed by the processing unit that is co-located with the target data; (3) a network design optimized for irregular traffic, where all communication is one-way, and messages do not contain routing metadata; (4) novel traffic-aware task scheduling hardware that maintains high core utilization; and (5) a data placement strategy that improves work balance. This work proposes architectural and software innovations to provide the greatest scalability to date for running graph algorithms while still being programmable for other domains.

翻译：低数据复用和频繁不规则内存访问的应用（如图或稀疏线性代数工作负载）因内存瓶颈和核心利用率低下而难以有效扩展。尽管先前通过预取、解耦或流水线技术的研究可以缓解内存延迟并提高核心利用率，但由于片外带宽限制，内存瓶颈依然存在。采用混合内存立方体（HMC）进行内存内处理（PIM）的方法克服了带宽限制，但由于任务调度不佳和同步开销过大，无法实现高核心利用率。此外，HMC提供的高内存核心比限制了强扩展性。我们提出Dalorex，一种软硬件协同设计方案，可实现高并行性和能效，在处理图和稀疏线性代数工作负载时，在超过16,000个核心上展示了强扩展性。与先前使用256个核心的PIM工作相比，Dalorex通过以下方式将性能和能耗提升两个数量级：（1）基于瓦片的分布式内存架构，每个处理瓦片持有等量数据，且所有内存操作均为本地操作；（2）基于任务的并行编程模型，任务由与目标数据共置的处理单元执行；（3）针对不规则流量优化的网络设计，所有通信均为单向，且消息不包含路由元数据；（4）新颖的流量感知任务调度硬件，可维持高核心利用率；（5）改善工作负载均衡的数据放置策略。本文提出的架构与软件创新为运行图算法提供了迄今为止最大的可扩展性，同时保持对其他领域的可编程性。