Portable, heterogeneous ensemble workflows at scale using libEnsemble

libEnsemble is a Python-based toolkit for running dynamic ensembles, developed as part of the DOE Exascale Computing Project. The toolkit utilizes a unique generator--simulator--allocator paradigm, where generators produce input for simulators, simulators evaluate those inputs, and allocators decide whether and when a simulator or generator should be called. The generator steers the ensemble based on simulation results. Generators may, for example, apply methods for numerical optimization, machine learning, or statistical calibration. libEnsemble communicates between a manager and workers. We overview the unique characteristics of libEnsemble as well as current and potential interoperability with other packages in the workflow ecosystem. We highlight libEnsemble's dynamic resource features: libEnsemble can detect system resources, such as available nodes, cores, and GPUs, and assign these in a portable way. These features allow users to specify the number of processors and GPUs required for each simulation; and resources will be automatically assigned on a wide range of systems, including Frontier, Aurora, and Perlmutter. Such ensembles can include multiple simulation types, some using GPUs and others using only CPUs, sharing nodes for maximum efficiency. We also describe the benefits of libEnsemble's generator--simulator coupling, which easily exposes to the user the ability to cancel, and portably kill, running simulations based on models that are updated with intermediate simulation output. We demonstrate libEnsemble's capabilities, scalability, and scientific impact via a Gaussian process surrogate training problem for the longitudinal density profile at the exit of a plasma accelerator stage. The study uses gpCAM for the surrogate model and employs either Wake-T or WarpX simulations, highlighting efficient use of resources that can easily extend to exascale.

翻译：libEnsemble是美国能源部百亿亿次计算项目（DOE Exascale Computing Project）开发的一款基于Python的动态集成运行工具包。该工具包采用独特的生成器-模拟器-分配器范式：生成器为模拟器生成输入参数，模拟器对这些输入进行评估，而分配器则决定是否以及何时调用模拟器或生成器。生成器根据模拟结果引导整个集成过程，例如可应用数值优化、机器学习或统计校准等方法。libEnsemble通过管理器与工作节点之间进行通信。本文概述了libEnsemble的独特特性，及其与工作流生态系统中其他软件包的现有及潜在互操作性。我们重点介绍了libEnsemble的动态资源管理功能：该工具能检测系统资源（如可用节点、核心和GPU），并以可移植的方式进行分配。用户可指定每次模拟所需的处理器和GPU数量，系统将在包括Frontier、Aurora和Perlmutter在内的多种平台上自动分配资源。此类集成工作流可包含多种模拟类型，部分使用GPU而其他仅使用CPU，通过节点共享实现效率最大化。我们还阐述了libEnsemble生成器-模拟器耦合机制的优势：该机制使用户能够基于实时更新的中间模拟输出模型，轻松实现运行中模拟任务的可移植终止。我们通过等离子体加速器出口纵向密度分布的高斯过程代理模型训练案例，展示了libEnsemble的性能、可扩展性及科学价值。该研究使用gpCAM构建代理模型，并采用Wake-T或WarpX进行模拟，彰显了可轻松扩展至百亿亿次规模的高效资源利用能力。