Sparse matrix vector multiplication (SpMV) is central to numerous data-intensive applications, but requires streaming indirect memory accesses that severely degrade both processing and memory throughput in state-of-the-art architectures. Near-memory hardware units, decoupling indirect streams from processing elements, partially alleviate the bottleneck, but rely on low DRAM access granularity, which is highly inefficient for modern DRAM standards like HBM and LPDDR. To fully address the end-to-end challenge, we propose a low-overhead data coalescer combined with a near-memory indirect streaming unit for AXI-Pack, an extension to the widespread AXI4 protocol packing narrow irregular stream elements onto wide memory buses. Our combined solution leverages the memory-level parallelism and coalescence of streaming indirect accesses in irregular applications like SpMV to maximize the performance and bandwidth efficiency attained on wide memory interfaces. Our solution delivers an average speedup of 8x in effective indirect access, often reaching the full memory bandwidth. As a result, we achieve an average end-to-end speedup on SpMV of 3x. Moreover, our approach demonstrates remarkable on-chip efficiency, requiring merely 27kB of on-chip storage and a very compact implementation area of 0.2-0.3mm^2 in a 12nm node.
翻译:稀疏矩阵向量乘法(SpMV)是众多数据密集型应用的核心,但需要流式间接内存访问,这在现有架构中严重降低了处理和内存吞吐量。近内存硬件单元将间接流与处理单元解耦,部分缓解了瓶颈,但其依赖低DRAM访问粒度,对于HBM和LPDDR等现代DRAM标准而言效率极低。为全面解决端到端挑战,我们提出了一种低开销数据合并器与近内存间接流单元相结合的方案,专为AXI-Pack设计——AXI-Pack是对广泛使用的AXI4协议的扩展,可将狭窄的不规则流元素打包到宽内存总线上。我们的联合解决方案利用不规则应用(如SpMV)中流式间接访问的内存级并行性和合并特性,最大化宽内存接口上的性能和带宽效率。该方案在有效间接访问方面实现了平均8倍的加速,通常能达到满内存带宽。因此,在SpMV上我们实现了平均3倍的端到端加速。此外,我们的方法展现出卓越的片上效率,仅需27kB的片上存储,并在12nm工艺节点上实现了0.2–0.3mm^2的紧凑实现面积。