HPC systems expose many configuration parameters that jointly drive competing objectives. Existing tools such as autotuners recommend good configurations but do not identify minimal changes for a near-miss configuration to meet a performance objective, and they often ignore domain-specific constraints. To address this gap, we introduce COMPASS -- a modular, programmable engine that uses operational traces to generate HPC configuration recommendations and guide tuning decisions. This paper: (1) formalizes configuration questions into query patterns; (2) develops an interactive decision-making engine that formulates these queries as Machine Learning (ML) tasks; (3) quantifies the trustworthiness of its recommendations by providing evidence and quantifying uncertainty, and -- when confidence is low -- provides guidance on which configurations to run next. We validate COMPASS using analytical ground truth, reconstruction accuracy, reproduction of published findings, and when possible, running on real hardware. When integrated with an open-source HPC scheduling simulator, COMPASS cuts average job turnaround time by 65.93% and node usage by 80.93% relative to the state-of-the-art. Moreover, COMPASS achieves up to 100x faster training and 80x faster inference than state-of-the-art generative methods, and scales to traces with 1.3B samples and 126GB of data.
翻译:高性能计算系统暴露了许多共同驱动竞争目标的配置参数。现有工具(如自动调优器)能够推荐良好配置,但无法识别近缺失配置为达成性能目标所需的最小变更,且往往忽略领域特定约束。为填补这一空白,我们提出COMPASS——一种模块化、可编程引擎,利用运行轨迹生成高性能计算配置推荐并指导调优决策。本文:(1) 将配置问题形式化为查询模式;(2) 开发交互式决策引擎,将这些查询表述为机器学习任务;(3) 通过提供证据与量化不确定性来评估推荐的可靠性,并在置信度较低时指导后续配置方案。我们通过解析真值、重建精度、复现已发表成果以及在可能情况下部署于真实硬件等方式验证COMPASS。当与开源高性能计算调度模拟器集成时,COMPASS相较于现有最优方法,将平均作业周转时间降低65.93%,节点使用率降低80.93%。此外,COMPASS的训练速度比现有最优生成方法快至100倍,推理速度快至80倍,并可扩展至13亿样本、126GB数据规模的运行轨迹。