Parallel applications with irregular and time-varying workloads often suffer from load imbalance. Dynamic load balancing techniques address this challenge by redistributing work during execution. We present a new type of distributed diffusion-based load balancing targeted at communication-intensive applications with persistently communicating objects. Leveraging the application's communication graph, our strategy reduces across-node communication while simultaneously distributing load effectively. We also propose an algorithmic variant for cases where the communication patterns are not readily available. We explore optimizations to our algorithm, and comparisons with other related load balancing strategies in simulation and on a Particle-in-Cell benchmark on up to 8 nodes of Perlmutter at NERSC.
翻译:具有不规则且随时间变化工作负载的并行应用常常面临负载不均的问题。动态负载均衡技术通过在程序执行过程中重新分配工作来应对这一挑战。我们提出了一种新型的基于扩散的分布式负载均衡方法,专门针对具有持续通信对象的通信密集型应用。通过利用应用的通信图,我们的策略能够在有效分配负载的同时,减少跨节点通信。针对通信模式不易获取的情况,我们还提出了一种算法变体。我们探索了算法的优化方案,并通过仿真实验以及在国家能源研究科学计算中心(NERSC)Perlmutter系统上多达8个节点的粒子网格(Particle-in-Cell)基准测试,与其他相关负载均衡策略进行了比较。