The bulk synchronous parallel (BSP) model struggles with irregular workloads due to rigid global communication. While fine-grained asynchronous BSP (FA-BSP) improves overlap, existing implementations typically rely on a limiting one-process-per-core model. This paper proposes a multithreaded FA-BSP approach combining Lightweight Communication Interface (LCI) and OpenMP to fully exploit multicore architectures. We evaluate this design using the NAS Parallel Benchmark Integer Sort (IS), retaining the original irregular Gaussian distribution to rigorously test load balancing. By replacing synchronous MPI collectives with OpenMP multithreading and LCI's fine-grained, zero-copy active messages, we enable efficient computation-communication overlap. Our evaluation demonstrates that multithreaded FA-BSP significantly outperforms traditional bulk-synchronous MPI implementations, offering a scalable solution for irregular scientific applications.
翻译:摘要:由于刚性全局通信的限制,BSP(块同步并行)模型在处理不规则工作负载时表现不佳。尽管细粒度异步BSP(FA-BSP)改善了计算与通信的重叠度,但现有实现通常依赖于受限的单进程单核模型。本文提出了一种结合轻量级通信接口(LCI)与OpenMP的多线程细粒度异步BSP方法,以充分挖掘多核架构的潜力。我们采用NAS并行基准测试中的整数排序(IS)评估该设计,保留原始的不规则高斯分布以严格测试负载均衡能力。通过以OpenMP多线程和LCI的细粒度零拷贝主动消息替代同步MPI集合操作,我们实现了高效的计算-通信重叠。评估结果表明,多线程FA-BSP显著优于传统块同步MPI实现,为不规则科学应用提供了可扩展的解决方案。