From Fork-Join to Asynchronous Tasks: Parallelizing Tiled Cholesky Decomposition with OpenMP and HPX

Fork-join parallelism, popularized by OpenMP, remains the dominant model for shared-memory parallel programming, but its implicit synchronization barriers can penalize algorithms with inhomogeneous workloads. Asynchronous many-task (AMT) runtimes sidestep these barriers by expressing work as a dependency graph of fine-grained tasks. Yet, the actual performance benefit over a carefully written fork-join baseline is rarely quantified. In this work, we introduce Cholesky-Bench and use it to revisit the tiled Cholesky decomposition, a canonical irregular kernel, comparing four parallelization variants of the right-looking algorithm across two runtimes: the OpenMP implementations shipped with GCC and LLVM, and the HPX AMT runtime. The variants span classical fork-join, a collapsed fork-join that exposes additional inner-loop parallelism, synchronous tasking, and asynchronous tasking with explicit data dependencies. We benchmark all eight combinations on a dual-socket 128-core AMD Zen 2 node across multiple tile sizes and problem sizes. Our results show that across all variants, HPX outperforms OpenMP at the optimal tile size by 15%-30%. Specifically, asynchronous HPX tasks are up to 26% faster than their OpenMP counterparts, and exhibit roughly 3.8x smaller task overhead. Furthermore, the collapsed fork-join variants close most of the gap to synchronous tasking. Removing redundant synchronization barriers yields an additional improvement of 7% (OpenMP) to 14% (HPX). A GCC-versus-LLVM comparison further reveals compiler-specific differences in fork-join scheduling and task-creation overheads.

翻译：由OpenMP推广的分叉-合并并行模型仍是共享内存并行编程的主导模型，但其隐式同步障碍会对负载不均的算法造成性能惩罚。异步多任务运行时通过将计算表达为细粒度任务的依赖图来规避这些障碍。然而，相较于精心编写的分叉-合并基线，其实际性能优势鲜有量化评估。本文提出Cholesky-Bench基准测试工具，重新审视经典的规则不规则核——分块Cholesky分解，在GCC和LLVM附带的OpenMP实现与HPX异步多任务运行时上，对比右视算法的四种并行化变体。这些变体涵盖经典分叉-合并、暴露额外内层循环并行的压缩分叉-合并、同步任务，以及具有显式数据依赖的异步任务。我们在双路128核AMD Zen 2节点上，针对多种分块尺寸和问题规模对所有八种组合进行基准测试。结果表明，在所有变体中，HPX在最优分块尺寸下比OpenMP快15%-30%。具体而言，异步HPX任务比对应的OpenMP任务快达26%，且任务开销约为其1/3.8。此外，压缩分叉-合并变体缩小了与同步任务的大部分差距。消除冗余同步障碍带来了额外7%（OpenMP）至14%（HPX）的性能提升。GCC与LLVM的对比进一步揭示了分叉-合并调度与任务创建开销中与编译器相关的差异。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《异步通信下的分布式武器-目标分配》

专知会员服务

63+阅读 · 2024年6月21日

【CVPR2024】DistriFusion: 高分辨率扩散模型的分布式并行推理

专知会员服务

22+阅读 · 2024年3月1日

【NeurIPS2023】MultiModN:多模态，多任务，可解释的模块化网络

专知会员服务

40+阅读 · 2023年9月27日