Portable, heterogeneous ensemble workflows at scale using libEnsemble

libEnsemble is a Python-based toolkit for running dynamic ensembles, developed as part of the DOE Exascale Computing Project. The toolkit utilizes a unique generator-simulator-allocator paradigm, where generators produce input for simulators, simulators evaluate those inputs, and allocators decide whether and when a simulator or generator should be called. The generator steers the ensemble based on simulation results. libEnsemble communicates between a manager and workers. Flexibility is provided through multiple manager-worker communication substrates each of which has different benefits. These include Python's multiprocessing, mpi4py, and TCP. Multisite ensembles are supported using Balsam or Globus Compute. We overview the unique characteristics of libEnsemble as well as current and potential interoperability with other packages in the workflow ecosystem. We highlight libEnsemble's dynamic resource features: libEnsemble can detect system resources (nodes, cores, and GPUs) and assign these in a portable way. These features allow users to specify resources required for each simulation automatically on a range of systems, including Frontier, Aurora, and Perlmutter. Such ensembles can include multiple simulation types, some using GPUs and others using only CPUs, sharing nodes for maximum efficiency. We demonstrate libEnsemble's capabilities, scalability, and scientific impact via a Gaussian process surrogate training problem for the longitudinal density profile at the exit of a plasma accelerator stage using Wake-T and WarpX simulations. We also describe the benefits of libEnsemble's generator-simulator coupling, which easily exposes to the user the ability to cancel, and portably kill, running simulations. Such control can be directed from the generator or allocator based on models that are updated with intermediate simulation output.

翻译：libEnsemble是一个基于Python的工具包，用于运行动态集成，作为美国能源部（DOE）百亿亿次计算项目的一部分而开发。该工具包采用独特的生成器-模拟器-分配器范式，其中生成器为模拟器产生输入，模拟器评估这些输入，而分配器决定是否以及何时调用模拟器或生成器。生成器根据模拟结果引导集成过程。libEnsemble通过管理者与工作者之间的通信进行协调。其灵活性源于多种管理者-工作者通信载体，每种载体具有不同的优势，包括Python的多进程、mpi4py和TCP。通过Balsam或Globus Compute支持多站点集成。我们概述了libEnsemble的独特特性，以及其与工作流生态系统中其他包的当前及潜在互操作性。重点介绍了libEnsemble的动态资源特性：它能检测系统资源（节点、核心和GPU），并以可移植的方式分配这些资源。这些特性允许用户自动在多种系统（包括Frontier、Aurora和Perlmutter）上指定每次模拟所需的资源。此类集成可包含多种模拟类型，有些使用GPU而另一些仅使用CPU，通过共享节点实现最高效率。我们通过基于Wake-T和WarpX模拟的等离子体加速器阶段出口纵向密度分布的高斯过程代理训练问题，展示了libEnsemble的能力、可扩展性和科学影响。我们还描述了libEnsemble生成器-模拟器耦合的优势，即能方便地使用户具备取消和可移植地终止正在运行的模拟的能力。这种控制可基于随中间模拟输出更新的模型，由生成器或分配器直接执行。