For years, SIMD/vector units have enhanced the capabilities of modern CPUs in High-Performance Computing (HPC) and mobile technology. Typical commercially-available SIMD units process up to 8 double-precision elements with one instruction. The optimal vector width and its impact on CPU throughput due to memory latency and bandwidth remain challenging research areas. This study examines the behavior of four computational kernels on a RISC-V core connected to a customizable vector unit, capable of operating up to 256 double precision elements per instruction. The four codes have been purposefully selected to represent non-dense workloads: SpMV, BFS, PageRank, FFT. The experimental setup allows us to measure their performance while varying the vector length, the memory latency, and bandwidth. Our results not only show that larger vector lengths allow for better tolerance of limitations in the memory subsystem but also offer hope to code developers beyond dense linear algebra.
翻译:多年来,SIMD/向量单元不断增强现代中央处理器在高性能计算(HPC)与移动技术领域的能力。典型商用SIMD单元每条指令可处理最多8个双精度浮点元素。最优向量宽度及其因内存延迟与带宽对CPU吞吐量产生的影响,仍是极具挑战性的研究领域。本研究考察了四种计算核心在搭载可定制向量单元(该单元每条指令可处理多达256个双精度元素)的RISC-V处理器上的行为特征。我们特意选取了代表非密集工作负载的四种代码:SpMV、BFS、PageRank与FFT。实验配置允许我们在改变向量长度、内存延迟及带宽的条件下测量其性能。研究结果不仅表明更长的向量长度能更好地容忍内存子系统的局限性,也为非稠密线性代数领域的代码开发人员带来了希望。