面向自动调优的大规模搜索空间高效构建方法 (Efficient Construction of Large Search Spaces for Auto-Tuning)

Automatic performance tuning, or auto-tuning, accelerates high-performance codes by exploring vast spaces of code variants. However, due to the large number of possible combinations and complex constraints, constructing these search spaces can be a major bottleneck. Real-world applications have been encountered where the search space construction takes minutes to hours or even days. Current state-of-the-art techniques for search space construction, such as chain-of-trees, lack a formal foundation and only perform adequately on a specific subset of search spaces. We show that search space construction for constraint-based auto-tuning can be reformulated as a Constraint Satisfaction Problem (CSP). Building on this insight with a CSP solver, we develop a runtime parser that translates user-defined constraint functions into solver-optimal expressions, optimize the solver to exploit common structures in auto-tuning constraints, and integrate these and other advances in open-source tools. These contributions substantially improve performance and accessibility while preserving flexibility. We evaluate our approach using a diverse set of benchmarks, demonstrating that our optimized solver reduces construction time by four orders of magnitude versus brute-force enumeration, three orders of magnitude versus an unoptimized CSP solver, and one to two orders of magnitude versus leading auto-tuning frameworks built on chain-of-trees. We thus eliminate a critical scalability barrier for auto-tuning and provide a drop-in solution that enables the exploration of previously unattainable problem scales in auto-tuning and related domains.

翻译：自动性能调优（auto-tuning）通过探索大量代码变体空间来加速高性能代码的执行。然而，由于可能的组合数量庞大且约束条件复杂，构建这些搜索空间可能成为主要瓶颈。在实际应用中，搜索空间的构建耗时可达数分钟至数小时，甚至数天。当前最先进的搜索空间构建技术（如链式树结构方法）缺乏形式化基础，且仅在特定子集的搜索空间上表现尚可。本文证明，基于约束的自动调优搜索空间构建可重新表述为约束满足问题。基于这一认识并结合CSP求解器，我们开发了一个运行时解析器，将用户定义的约束函数转换为求解器最优表达式，优化求解器以利用自动调优约束中的常见结构，并将这些进展及其他改进集成到开源工具中。这些贡献在保持灵活性的同时，显著提升了性能与易用性。我们通过多样化基准测试评估所提方法，结果表明：相较于暴力枚举法，优化后的求解器将构建时间降低了四个数量级；相较于未优化的CSP求解器，降低了三个数量级；相较于基于链式树结构的领先自动调优框架，降低了一至两个数量级。因此，我们消除了自动调优的关键可扩展性障碍，提供了一种即插即用解决方案，使得在自动调优及相关领域中探索以往无法达到的问题规模成为可能。