Uncertainty Quantification (UQ) workloads are becoming increasingly common in science and engineering. They involve the submission of thousands or even millions of similar tasks with potentially unpredictable runtimes, where the total number is usually not known a priori. A static one-size-fits-all batch script would likely lead to suboptimal scheduling, and native schedulers installed on High Performance Computing (HPC) systems such as SLURM often struggle to efficiently handle such workloads. In this paper, we introduce a new load balancing approach suitable for UQ workflows. To demonstrate its efficiency in a real-world setting, we focus on the GS2 gyrokinetic plasma turbulence simulator. Individual simulations can be computationally demanding, with runtimes varying significantly-from minutes to hours-depending on the high-dimensional input parameters. Our approach uses UQ and Modelling Bridge, which offers a language-agnostic interface to a simulation model, combined with HyperQueue which works alongside the native scheduler. In particular, deploying this framework on HPC systems does not require system-level changes. We benchmark our proposed framework against a standalone SLURM approach using GS2 and a Gaussian Process surrogate thereof. Our results demonstrate a reduction in scheduling overhead by up to three orders of magnitude and a maximum reduction of 38% in CPU time for long-running simulations compared to the naive SLURM approach, while making no assumptions about the job submission patterns inherent to UQ workflows.
翻译:不确定性量化(UQ)工作负载在科学与工程领域正变得越来越普遍。这类工作负载涉及提交数千甚至数百万个运行时间可能无法预测的相似任务,且任务总数通常无法预先获知。静态的“一刀切”批量脚本很可能导致次优调度,而高能计算(HPC)系统(如SLURM)上安装的原生调度器往往难以高效处理此类工作负载。本文提出一种适用于UQ工作流的新型负载均衡方法。为在实际场景中验证其效率,我们以GS2回旋动力等离子体湍流模拟器为研究对象。单个模拟任务可能计算量巨大,其运行时间根据高维输入参数的不同存在显著差异——从数分钟到数小时不等。我们的方法结合了UQ与建模桥接器(该工具为模拟模型提供与编程语言无关的接口)以及可与原生调度器协同工作的HyperQueue系统。特别值得注意的是,在HPC系统上部署该框架无需进行系统级修改。我们使用GS2及其高斯过程代理模型,将所提框架与独立SLURM方案进行基准测试。结果表明:相较于原生SLURM方案,该框架将调度开销降低了最多三个数量级,并为长时间运行的模拟任务最高节省了38%的CPU时间,且无需对UQ工作流固有的作业提交模式做任何假设。