This paper presents a new theory of locality and its compiler support. The theory is fully symbolic and derives locality as polynomials, and the compiler analysis supports affine loop nests. They derive cache-performance scaling in quadratic and reciprocal expressions and are more general and precise than empirical scaling rules. Evaluated on a benchmark suite of 41 scientific kernels and tensor operations, the compiler requires an average of 41 seconds to derive the locality polynomials. After derivation, predicting the cache miss count for any given input size and cache configuration takes less than a millisecond. Across all tests--with and without loop fusion--the accuracy in the data movement prediction is 99.6\%, compared to simulated set-associative L1 data cache.
翻译:本文提出了一种新的局部性理论及其编译器支持。该理论采用完全符号化方法,将局部性表示为多项式形式,编译器分析支持仿射循环嵌套。该理论推导出二次函数和倒数函数形式的缓存性能缩放规律,相比经验缩放规则具有更高的普适性和精确度。在包含41个科学计算核心和张量运算的基准测试套件上评估,编译器平均需要41秒来推导局部性多项式。推导完成后,针对任意给定输入规模和缓存配置预测缓存缺失次数仅需不足1毫秒。在所有测试中(包括使用和未使用循环融合的情况),数据移动预测准确率达到99.6%,对比基准为模拟组相联L1数据缓存。