Multi-Strided Access Patterns to Boost Hardware Prefetching

Important memory-bound kernels, such as linear algebra, convolutions, and stencils, rely on SIMD instructions as well as optimizations targeting improved vectorized data traversal and data re-use to attain satisfactory performance. On on temporary CPU architectures, the hardware prefetcher is of key importance for efficient utilization of the memory hierarchy. In this paper, we demonstrate that transforming a memory access pattern consisting of a single stride to one that concurrently accesses multiple strides, can boost the utilization of the hardware prefetcher, and in turn improves the performance of memory-bound kernels significantly. Using a set of micro-benchmarks, we establish that accessing memory in a multi-strided manner enables more cache lines to be concurrently brought into the cache, resulting in improved cache hit ratios and higher effective memory bandwidth without the introduction of costly software prefetch instructions. Subsequently, we show that multi-strided variants of a collection of six memory-bound dense compute kernels outperform state-of-the-art counterparts on three different micro-architectures. More specifically, for kernels among which Matrix Vector Multiplication, Convolution Stencil and kernels from PolyBench, we achieve significant speedups of up to 12.55x over Polly, 2.99x over MKL, 1.98x over OpenBLAS, 1.08x over Halide and 1.87x over OpenCV. The code transformation to take advantage of multi-strided memory access is a natural extension of the loop unroll and loop interchange techniques, allowing this method to be incorporated into compiler pipelines in the future.

翻译：线性代数、卷积和模板计算等重要访存密集型核心算法，依赖于SIMD指令以及旨在改进向量化数据遍历和数据重用的优化技术，以获得令人满意的性能。在当代CPU架构中，硬件预取器对于高效利用内存层次结构至关重要。本文证明，将单一跨度的访存模式转换为同时访问多个跨度的模式，能够提升硬件预取器的利用率，进而显著改善访存密集型核心算法的性能。通过一组微基准测试，我们证实以多跨度方式访问内存能使更多缓存行同时载入缓存，从而提高缓存命中率和有效内存带宽，且无需引入代价高昂的软件预取指令。随后，我们展示了六个访存密集型密集计算核心算法的多跨度变体，在三种不同微架构上均优于当前最先进的对应实现。具体而言，对于矩阵向量乘法、卷积模板以及来自PolyBench的核心算法等，我们相较于Polly实现了最高12.55倍的加速，相较于MKL实现了2.99倍加速，相较于OpenBLAS实现了1.98倍加速，相较于Halide实现了1.08倍加速，相较于OpenCV实现了1.87倍加速。利用多跨度访存的代码转换是循环展开和循环交换技术的自然延伸，这使得该方法未来有望集成到编译器流水线中。