Portable, heterogeneous ensemble workflows at scale using libEnsemble

libEnsemble is a Python-based toolkit for running dynamic ensembles, developed as part of the DOE Exascale Computing Project. The toolkit utilizes a unique generator--simulator--allocator paradigm, where generators produce input for simulators, simulators evaluate those inputs, and allocators decide whether and when a simulator or generator should be called. The generator steers the ensemble based on simulation results. libEnsemble communicates between a manager and workers. Flexibility is provided through multiple manager--worker communication substrates each of which has different benefits. These include Python's multiprocessing, mpi4py, and TCP. Multisite ensembles are supported using Balsam or Globus Compute. We overview the unique characteristics of libEnsemble as well as current and potential interoperability with other packages in the workflow ecosystem. We highlight libEnsemble's dynamic resource features: libEnsemble can detect system resources (nodes, cores, and GPUs) and assign these in a portable way. These features allow users to specify resources required for each simulation automatically on a range of systems, including Frontier, Aurora, and Perlmutter. Such ensembles can include multiple simulation types, some using GPUs and others using only CPUs, sharing nodes for maximum efficiency. We demonstrate libEnsemble's capabilities, scalability, and scientific impact via a Gaussian process surrogate training problem for the longitudinal density profile at the exit of a plasma accelerator stage using Wake-T and WarpX simulations. We also describe the benefits of libEnsemble's generator--simulator coupling, which easily exposes to the user the ability to cancel, and portably kill, running simulations. Such control can be directed from the generator or allocator based on models that are updated with intermediate simulation output.

翻译：libEnsemble是一个基于Python的工具包，用于运行动态集成，作为美国能源部百亿亿次计算项目的一部分而开发。该工具包采用独特的生成器-模拟器-分配器范式，其中生成器为模拟器产生输入，模拟器评估这些输入，分配器则决定是否以及何时调用模拟器或生成器。生成器根据模拟结果引导集成的运行方向。libEnsemble在管理进程与工作进程之间进行通信。通过多种管理进程-工作进程通信子层提供灵活性，每种子层具有不同优势，包括Python的多进程模块、mpi4py和TCP协议。多站点集成支持通过Balsam或Globus Compute实现。我们概述了libEnsemble的独特特性，以及当前与工作流生态系统中其他包的可互操作性潜力。重点介绍了libEnsemble的动态资源特性：能够检测系统资源（节点、核心和GPU）并以可移植方式分配这些资源。这些特性允许用户在包括Frontier、Aurora和Perlmutter在内的多种系统上自动为每次模拟指定所需资源。此类集成可包含多种模拟类型，部分使用GPU而其他仅使用CPU，通过共享节点实现最高效率。我们通过基于Wake-T和WarpX模拟的等离子体加速器阶段出口纵向密度轮廓的高斯过程代理训练问题，展示了libEnsemble的能力、可扩展性和科学影响力。同时描述了libEnsemble生成器-模拟器耦合的优势，该机制使用户能够轻松取消和可移植地终止正在运行的模拟。这种控制可由生成器或分配器根据中间模拟输出更新的模型来执行。