We formally introduce a systematic (de/re)-composition approach, based on the algebraic formalism of "Multi-Dimensional Homomorphisms (MDHs)". Our approach is designed as general enough to be applicable to a wide range of data-parallel computations and for various kinds of target parallel architectures. To efficiently target the deep and complex memory and core hierarchies of contemporary architectures, we exploit our introduced (de/re)-composition approach for a correct-by-construction, parametrized cache blocking and parallelization strategy. We show that our approach is powerful enough to express, in the same formalism, the (de/re)-composition strategies of different classes of state-of-the-art approaches (scheduling-based, polyhedral, etc), and we demonstrate that the parameters of our strategies enable systematically generating code that can be fully automatically optimized (auto-tuned) for the particular target architecture and characteristics of the input and output data (e.g., their sizes and memory layouts). Particularly, our experiments confirm that via auto-tuning, we achieve higher performance than state-of-the-art approaches, including hand-optimized solutions provided by vendors (such as NVIDIA cuBLAS/cuDNN and Intel oneMKL/oneDNN), on real-world data sets and for a variety of data-parallel computations, including: linear algebra routines, stencil and quantum chemistry computations, data mining algorithms, and computations that recently gained high attention due to their relevance for deep learning.
翻译:我们正式提出了一种基于“多维同态(MDHs)”代数形式系统的系统性(去/重)组合方法。该方法设计具有足够通用性,可适用于广泛的数据并行计算及多种目标并行架构。为高效利用当代架构中深层复杂的内存与核心层级,我们利用所提出的(去/重)组合方法,实现一种正确性自证、参数化的缓存阻塞与并行化策略。研究表明,该方法能够以统一形式系统表达多类前沿方法(如调度基方法、多面体模型等)的(去/重)组合策略,且策略参数可系统化生成代码,并针对特定目标架构及输入/输出数据特征(如尺寸、内存布局)实现全自动优化(自动调参)。特别地,实验证实:通过自动调参,我们在真实数据集上对各类数据并行计算(包括线性代数例程、模板计算与量子化学计算、数据挖掘算法,以及近期因深度学习相关性备受关注的计算)中,均取得了超越前沿方法(含供应商提供的经手工优化方案,如NVIDIA cuBLAS/cuDNN与Intel oneMKL/oneDNN)的性能表现。