大规模辐射流体动力学：基于FleCSI对比MPI与异步多任务运行时系统 (Radiation Hydrodynamics at Scale: Comparing MPI and Asynchronous Many-Task Runtimes with FleCSI)

Writing efficient distributed code remains a labor-intensive and complex endeavor. To simplify application development, the Flexible Computational Science Infrastructure (FleCSI) framework offers a user-oriented, high-level programming interface that is built upon a task-based runtime model. Internally, FleCSI integrates state-of-the-art parallelization backends, including MPI and the asynchronous many-task runtimes (AMTRs) Legion and HPX, enabling applications to fully leverage asynchronous parallelism. In this work, we benchmark two applications using FleCSI's three backends on up to 1024 nodes, intending to quantify the advantages and overheads introduced by the AMTR backends. As representative applications, we select a simple Poisson solver and the multidimensional radiation hydrodynamics code HARD. In the communication-focused Poisson solver benchmark, FleCSI achieves over 97% parallel efficiency using the MPI backend under weak scaling on up to 131072 cores, indicating that only minimal overhead is introduced by its abstraction layer. While the Legion backend exhibits notable overheads and scaling limitations, the HPX backend introduces only marginal overhead compared to MPI+Kokkos. However, the scalability of the HPX backend is currently limited due to the usage of non-optimized HPX collective operations. In the computation-focused radiation hydrodynamics benchmarks, the performance gap between the MPI and HPX backends fades. On fewer than 64 nodes, the HPX backend outperforms MPI+Kokkos, achieving an average speedup of 1.31 under weak scaling and up to 1.27 under strong scaling. For the hydrodynamics-only HARD benchmark, the HPX backend demonstrates superior performance on fewer than 32 nodes, achieving speedups of up to 1.20 relative to MPI and up to 1.64 relative to MPI+Kokkos.

翻译：编写高效的分布式代码仍然是一项劳动密集型且复杂的任务。为简化应用程序开发，灵活计算科学基础设施（FleCSI）框架提供了一个面向用户的高层编程接口，该接口构建于基于任务的运行时模型之上。在内部，FleCSI集成了先进的并行化后端，包括MPI以及异步多任务运行时系统（AMTR）Legion和HPX，使应用程序能够充分利用异步并行性。在本工作中，我们在最多1024个节点上使用FleCSI的三个后端对两个应用程序进行基准测试，旨在量化AMTR后端带来的优势与开销。我们选取了一个简单的泊松求解器和多维辐射流体动力学代码HARD作为代表性应用。在侧重通信的泊松求解器基准测试中，FleCSI使用MPI后端在弱扩展条件下于最多131072个核心上实现了超过97%的并行效率，表明其抽象层仅引入了极小的开销。虽然Legion后端表现出显著的开销和扩展限制，但HPX后端相较于MPI+Kokkos仅引入了边际开销。然而，由于使用了未优化的HPX集体操作，HPX后端的可扩展性目前受到限制。在侧重计算的辐射流体动力学基准测试中，MPI与HPX后端之间的性能差距逐渐缩小。在少于64个节点时，HPX后端表现优于MPI+Kokkos，在弱扩展下平均加速比达到1.31，在强扩展下最高达到1.27。对于纯流体动力学的HARD基准测试，HPX后端在少于32个节点时展现出更优性能，相较于MPI最高加速比达1.20，相较于MPI+Kokkos最高加速比达1.64。