Squire: A General-Purpose Accelerator to Exploit Fine-Grain Parallelism on Dependency-Bound Kernels

Multiple HPC applications are often bottlenecked by compute-intensive kernels implementing complex dependency patterns (data-dependency bound). Traditional general-purpose accelerators struggle to effectively exploit fine-grain parallelism due to limitations in implementing convoluted data-dependency patterns (like SIMD) and overheads due to synchronization and data transfers (like GPGPUs). In contrast, custom FPGA and ASIC designs offer improved performance and energy efficiency at a high cost in hardware design and programming complexity and often lack the flexibility to process different workloads. We propose Squire, a general-purpose accelerator designed to exploit fine-grain parallelism effectively on dependency-bound kernels. Each Squire accelerator has a set of general-purpose low-power in-order cores that can rapidly communicate among themselves and directly access data from the L2 cache. Our proposal integrates one Squire accelerator per core in a typical multicore system, allowing the acceleration of dependency-bound kernels within parallel tasks with minimal software changes. As a case study, we evaluate Squire's effectiveness by accelerating five kernels that implement complex dependency patterns. We use three of these kernels to build an end-to-end read-mapping tool that will be used to evaluate Squire. Squire obtains speedups up to 7.64$\times$ in dynamic programming kernels. Overall, Squire provides an acceleration for an end-to-end application of 3.66$\times$. In addition, Squire reduces energy consumption by up to 56% with a minimal area overhead of 10.5% compared to a Neoverse-N1 baseline.

翻译：许多高性能计算应用常受限于实现复杂依赖模式的计算密集型内核（数据依赖密集型）。传统通用加速器由于在实现复杂数据依赖模式（如SIMD）方面的局限性，以及同步和数据传输带来的开销（如GPGPU），难以有效利用细粒度并行性。相比之下，定制化的FPGA和ASIC设计虽能提供更高的性能和能效，但硬件设计和编程复杂度成本高昂，且往往缺乏处理不同工作负载的灵活性。本文提出Squire，一种通用加速器，旨在有效利用依赖密集型内核中的细粒度并行性。每个Squire加速器配备一组通用的低功耗顺序执行核心，这些核心能够快速相互通信并直接访问L2缓存数据。我们的方案在典型多核系统中为每个核心集成一个Squire加速器，从而能够以最小的软件改动代价加速并行任务中的依赖密集型内核。作为案例研究，我们通过加速五个实现复杂依赖模式的内核来评估Squire的有效性。其中三个内核被用于构建端到端的读段比对工具，该工具将用于评估Squire性能。Squire在动态规划内核中实现了最高7.64倍的加速比。整体而言，Squire为端到端应用提供了3.66倍的加速效果。此外，与Neoverse-N1基准系统相比，Squire在仅增加10.5%面积开销的情况下，能耗降低最高达56%。