For years, SIMD/vector units have enhanced the capabilities of modern CPUs in High-Performance Computing (HPC) and mobile technology. Typical commercially-available SIMD units process up to 8 double-precision elements with one instruction. The optimal vector width and its impact on CPU throughput due to memory latency and bandwidth remain challenging research areas. This study examines the behavior of four computational kernels on a RISC-V core connected to a customizable vector unit, capable of operating up to 256 double precision elements per instruction. The four codes have been purposefully selected to represent non-dense workloads: SpMV, BFS, PageRank, FFT. The experimental setup allows us to measure their performance while varying the vector length, the memory latency, and bandwidth. Our results not only show that larger vector lengths allow for better tolerance of limitations in the memory subsystem but also offer hope to code developers beyond dense linear algebra.
翻译:多年来,SIMD/向量单元增强了现代CPU在高性能计算(HPC)和移动技术中的能力。典型商用SIMD单元通过一条指令最多可处理8个双精度元素。最优向量宽度及其因内存延迟和带宽对CPU吞吐量的影响仍然是具有挑战性的研究领域。本研究分析了四个计算核心在一个连接可定制向量单元(每条指令最多可处理256个双精度元素)的RISC-V内核上的行为。这四个代码被特意选择以代表非稠密工作负载:SpMV、BFS、PageRank、FFT。实验设置使我们能够测量它们在改变向量长度、内存延迟和带宽时的性能。我们的结果不仅表明更长的向量长度能更好地容忍内存子系统的限制,还为超越稠密线性代数的代码开发者带来了希望。