Fully-strict fork-join parallelism is a powerful model for shared-memory programming due to its optimal time scaling and strong bounds on memory scaling. The latter is rarely achieved due to the difficulty of implementing continuation stealing in traditional High Performance Computing (HPC) languages -- where it is often impossible without modifying the compiler or resorting to non-portable techniques. We demonstrate how stackless coroutines (a new feature in C++20) can enable fully-portable continuation stealing and present libfork a lock-free fine-grained parallelism library, combining coroutines with user-space, geometric segmented-stacks. We show our approach is able to achieve optimal time/memory scaling, both theoretically and empirically, across a variety of benchmarks. Compared to openMP (libomp), libfork is on average 7.2x faster and consumes 10x less memory. Similarly, compared to Intel's TBB, libfork is on average 2.7x faster and consumes 6.2x less memory. Additionally, we introduce non-uniform memory access (NUMA) optimizations for schedulers that demonstrate performance matching busy-waiting schedulers.
翻译:全严格分叉-合并并行是一种强大的共享内存编程模型,因其具备最优时间扩展性和强内存扩展界。然而,由于在传统高性能计算(HPC)语言中实现延续窃取的难度——通常需修改编译器或采用非可移植技术——后者难以实现。我们展示无栈协程(C++20新增特性)如何实现完全可移植的延续窃取,并提出libfork——一种结合协程与用户态几何分段栈的无锁细粒度并行库。理论分析与实验结果表明,该方法能在多种基准测试中实现最优时间/内存扩展性。与OpenMP(libomp)相比,libfork平均速度快7.2倍,内存消耗减少10倍;与英特尔TBB相比,平均速度快2.7倍,内存消耗减少6.2倍。此外,我们引入非统一内存访问(NUMA)优化调度器,其性能可媲美忙等待调度器。