The paper provides a unified co-design of 1) a programming and execution model that allows spawning tasks from within the vertex data at runtime, 2) language constructs for \textit{actions} that send work to where the data resides, combining parallel expressiveness of local control objects (LCOs) to implement asynchronous graph processing primitives, 3) and an innovative vertex-centric data-structure, using the concept of Rhizomes, that parallelizes both the out and in-degree load of vertex objects across many cores and yet provides a single programming abstraction to the vertex objects. The data structure hierarchically parallelizes the out-degree load of vertices and the in-degree load laterally. The rhizomes internally communicate and remain consistent, using event-driven synchronization mechanisms, to provide a unified and correct view of the vertex. Simulated experimental results show performance gains for BFS, SSSP, and Page Rank on large chip sizes for the tested input graph datasets containing highly skewed degree distribution. The improvements come from the ability to express and create fine-grain dynamic computing task in the form of \textit{actions}, language constructs that aid the compiler to generate code that the runtime system uses to optimally schedule tasks, and the data structure that shares both in and out-degree compute workload among memory-processing elements.
翻译:本文提出了一种统一的协同设计方法,包括:1) 一种允许在运行时从顶点数据内部生成任务的编程与执行模型;2) 用于将工作发送至数据驻留位置的“动作”语言构造,结合局部控制对象(LCO)的并行表达能力实现异步图处理原语;3) 一种基于根茎概念的创新顶点中心数据结构,该结构将顶点对象的出度与入度负载并行化到多个核心上,同时为顶点对象提供单一编程抽象。该数据结构分层并行化顶点的出度负载,并横向并行化入度负载。根茎内部通过事件驱动同步机制进行通信并保持一致,从而提供统一且正确的顶点视图。仿真实验结果表明,对于包含高度偏斜度分布的测试输入图数据集,在大型芯片规模下,BFS、SSSP和PageRank算法均获得了性能提升。这些改进源于以下能力:以“动作”形式表达和创建细粒度动态计算任务;辅助编译器生成代码的语言构造,供运行时系统用于优化调度任务;以及在内存处理元件间共享入度和出度计算负载的数据结构。