Cost-Effective Methodology for Complex Tuning Searches in HPC: Navigating Interdependencies and Dimensionality

Tuning searches are pivotal in High-Performance Computing (HPC), addressing complex optimization challenges in computational applications. The complexity arises not only from finely tuning parameters within routines but also potential interdependencies among them, rendering traditional optimization methods inefficient. Instead of scrutinizing interdependencies among parameters and routines, practitioners often face the dilemma of conducting independent tuning searches for each routine, thereby overlooking interdependence, or pursuing a more resource-intensive joint search for all routines. This decision is driven by the consideration that some interdependence analysis and high-dimensional decomposition techniques in literature may be prohibitively expensive in HPC tuning searches. Our methodology adapts and refines these methods to ensure computational feasibility while maximizing performance gains in real-world scenarios. Our methodology leverages a cost-effective interdependence analysis to decide whether to merge several tuning searches into a joint search or conduct orthogonal searches. Tested on synthetic functions with varying levels of parameter interdependence, our methodology efficiently explores the search space. In comparison to Bayesian-optimization-based full independent or fully joint searches, our methodology suggested an optimized breakdown of independent and merged searches that led to final configurations up to 8% more accurate, reducing the search time by up to 95%. When applied to GPU-offloaded Real-Time Time-Dependent Density Functional Theory (RT-TDDFT), an application in computational materials science that challenges modern HPC autotuners, our methodology achieved an effective tuning search. Its adaptability and efficiency extend beyond RT-TDDFT, making it valuable for related applications in HPC.

翻译：调优搜索在高效能计算（HPC）中至关重要，旨在解决计算应用中的复杂优化难题。其复杂性不仅源于对子程序内参数的精细调节，更在于参数间可能存在的相互依赖关系，这使得传统优化方法难以奏效。实践中，研究人员常面临两难选择：要么对每个子程序独立进行调优搜索（从而忽略相互依赖性），要么对所有子程序进行资源消耗更大的联合搜索。这种决策的驱动因素在于文献中某些相互依赖性分析和高维分解技术在HPC调优搜索场景下可能成本过高。本文提出的方法通过适配与改进这些技术，在确保计算可行性的同时最大化实际场景中的性能增益。该方法利用经济高效的相互依赖性分析，决定是将多个调优搜索合并为联合搜索，还是采用正交搜索策略。经具有不同参数关联程度的合成函数测试，该方法能高效探索搜索空间。与基于贝叶斯优化的完全独立搜索或完全联合搜索相比，本方法提出的独立搜索与合并搜索优化分解方案，可使最终配置精度提升高达8%，同时将搜索时间减少最高95%。当应用于GPU加速的实时含时密度泛函理论（RT-TDDFT）——这一现代HPC自动调优器面临挑战的计算材料学应用时，本方法实现了有效的调优搜索。其适应性与高效性可扩展至RT-TDDFT之外，对HPC相关应用具有重要意义。