Coarse-Grained Reconfigurable Arrays (CGRA) are promising edge accelerators due to the outstanding balance in flexibility, performance, and energy efficiency. Classic CGRAs statically map compute operations onto the processing elements (PE) and route the data dependencies among the operations through the Network-on-Chip. However, CGRAs are designed for fine-grained static instruction-level parallelism and struggle to accelerate applications with dynamic and irregular data-level parallelism, such as graph processing. To address this limitation, we present Flip, a novel accelerator that enhances traditional CGRA architectures to boost the performance of graph applications. Flip retains the classic CGRA execution model while introducing a special data-centric mode for efficient graph processing. Specifically, it exploits the natural data parallelism of graph algorithms by mapping graph vertices onto processing elements (PEs) rather than the operations, and supporting dynamic routing of temporary data according to the runtime evolution of the graph frontier. Experimental results demonstrate that Flip achieves up to 36$\times$ speedup with merely 19% more area compared to classic CGRAs. Compared to state-of-the-art large-scale graph processors, Flip has similar energy efficiency and 2.2$\times$ better area efficiency at a much-reduced power/area budget.
翻译:粗粒度可重构阵列(CGRA)因其在灵活性、性能和能效之间的出色平衡,成为有前景的边缘加速器。经典CGRA将计算操作静态映射到处理单元上,并通过片上网络路由操作间的数据依赖关系。然而,CGRA专为细粒度静态指令级并行性设计,难以加速具有动态和非规则数据级并行性的应用(如图处理)。为解决这一局限,我们提出Flip,一种新型加速器,旨在增强传统CGRA架构以提升图应用的性能。Flip保留经典CGRA执行模型,同时引入一种特殊的数据中心模式,用于高效图处理。具体而言,它通过将图顶点而非操作映射到处理单元上,并根据图前沿的运行时演变支持临时数据的动态路由,从而利用图算法的天然数据并行性。实验结果表明,与传统CGRA相比,Flip在仅增加19%面积的情况下,性能提升高达36倍。与最先进的大规模图处理器相比,Flip在显著降低的功耗/面积预算下,具有相近的能效和2.2倍更好的面积效率。