Modern HPC systems are increasingly relying on greater core counts and wider vector registers. Thus, applications need to be adapted to fully utilize these hardware capabilities. One class of applications that can benefit from this increase in parallelism are molecular dynamics simulations. In this paper, we describe our efforts at modernizing the ESPResSo++ molecular dynamics simulation package by restructuring its particle data layout for efficient memory accesses and applying vectorization techniques to benefit the calculation of short-range non-bonded forces, which results in an overall three times speedup and serves as a baseline for further optimizations. We also implement fine-grained parallelism for multi-core CPUs through HPX, a C++ runtime system which uses lightweight threads and an asynchronous many-task approach to maximize concurrency. Our goal is to evaluate the performance of an HPX-based approach compared to the bulk-synchronous MPI-based implementation. This requires the introduction of an additional layer to the domain decomposition scheme that defines the task granularity. On spatially inhomogeneous systems, which impose a corresponding load-imbalance in traditional MPI-based approaches, we demonstrate that by choosing an optimal task size, the efficient work-stealing mechanisms of HPX can overcome the overhead of communication resulting in an overall 1.4 times speedup compared to the baseline MPI version.
翻译:现代高性能计算系统日益依赖更高的核心数量和更宽的向量寄存器。因此,应用程序需要适配以充分利用这些硬件能力。分子动力学模拟是一类能从这种并行性提升中受益的应用。本文描述了我们对ESPResSo++分子动力学模拟包进行现代化的努力,通过重构其粒子数据布局以实现高效内存访问,并应用向量化技术以优化短程非键力的计算,最终实现了整体三倍的加速,并为后续优化奠定了基础。我们还通过HPX(一种C++运行时系统)为多核CPU实现了细粒度并行性,该系统使用轻量级线程和异步多任务方法最大化并发度。我们的目标是评估基于HPX的方法与基于批量同步MPI的实现相比的性能表现。这需要在定义任务粒度的区域分解方案中引入额外的层级。在空间非均匀系统上(传统基于MPI的方法中因负载不均导致相应问题),我们证明通过选择最优任务大小,HPX的高效任务窃取机制能够克服通信开销,相比基线MPI版本实现整体1.4倍的加速。