Towards a Benchmarking Suite for Kernel Tuners

As computing system become more complex, it is becoming harder for programmers to keep their codes optimized as the hardware gets updated. Autotuners try to alleviate this by hiding as many architecture-based optimization details as possible from the user, so that the code can be used efficiently across different generations of systems. In this article we introduce a new benchmark suite for evaluating the performance of optimization algorithms used by modern autotuners targeting GPUs. The suite contains tunable GPU kernels that are representative of real-world applications, allowing for comparisons between optimization algorithms and the examination of code optimization, search space difficulty, and performance portability. Our framework facilitates easy integration of new autotuners and benchmarks by defining a shared problem interface. Our benchmark suite is evaluated based on five characteristics: convergence rate, local minima centrality, optimal speedup, Permutation Feature Importance (PFI), and performance portability. The results show that optimization parameters greatly impact performance and the need for global optimization. The importance of each parameter is consistent across GPU architectures, however, the specific values need to be optimized for each architecture. Our portability study highlights the crucial importance of autotuning each application for a specific target architecture. The results reveal that simply transferring the optimal configuration from one architecture to another can result in a performance ranging from 58.5% to 99.9% of the optimal performance, depending on the GPU architecture. This highlights the importance of autotuning in modern computing systems and the value of our benchmark suite in facilitating the study of optimization algorithms and their effectiveness in achieving optimal performance for specific target architectures.

翻译：随着计算系统日益复杂，程序员在硬件更新时保持代码优化的难度不断加大。自动调优工具通过向用户隐藏尽可能多的基于架构的优化细节来缓解这一问题，从而使代码能够在不同代际的系统间高效运行。本文介绍了一套新的基准测试套件，用于评估针对GPU的现代自动调优工具所采用的优化算法的性能。该套件包含代表实际应用的可调GPU内核，支持优化算法间的比较，并能够检验代码优化、搜索空间难度及性能可移植性。我们的框架通过定义共享问题接口，便于新自动调优工具和基准测试的集成。该基准测试套件基于五个特征进行评估：收敛速率、局部极小值集中度、最优加速比、排列特征重要性（PFI）和性能可移植性。结果表明，优化参数对性能影响显著，且全局优化具有必要性。各参数的重要性在GPU架构间保持一致，但具体数值需针对每种架构进行优化。我们的可移植性研究凸显了针对特定目标架构进行各应用自动调优的关键性。结果显示，简单地将最优配置从一种架构迁移至另一种架构所能达到的性能，仅为最优性能的58.5%至99.9%（具体取决于GPU架构）。这充分体现了自动调优在现代计算系统中的重要性，以及我们的基准测试套件在促进优化算法研究及其在特定目标架构上实现最优性能有效性评估方面的重要价值。