Matrix libraries often focus on achieving high performance for problems considered to be either "small" or "large", as these two scenarios tend to respond best to different optimization strategies. We propose a unified technique for implementing matrix operations like general matrix multiplication (GEMM) that can achieve high performance for both small and large problem sizes. The key is to fuse packing -- an operation that copies data to a contiguous layout in memory and which is critical for large matrix performance -- with the first computational "pass" over that data. This boosts performance across the problem size spectrum. As a result, tuning general-purpose libraries becomes simpler since it obviates the need to carefully express and parameterize logic that chooses between a "small matrix" strategy and a "large matrix" strategy. A prototype implementation of the technique built with the BLAS-like Library Instantiation Software (BLIS) framework is described and performance on a range of architectures is reported.
翻译:矩阵库通常专注于为“小规模”或“大规模”问题实现高性能,因为这两种场景往往最适合不同的优化策略。我们提出一种统一的矩阵运算实现技术(如通用矩阵乘法GEMM),能在小规模与大规模问题上均获得高性能。关键在于将数据打包操作(将数据复制到内存中连续布局的操作,对大规模矩阵性能至关重要)与该数据的首次计算“遍次”融合。这一方法提升了全问题规模谱系的性能。由此,调优通用库变得更加简单,因为它无需精心设计并参数化选择“小矩阵策略”与“大矩阵策略”的逻辑。我们描述了基于类BLAS库实例化软件(BLIS)框架构建的原型实现,并报告了其在多种架构上的性能表现。