Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms, a systematic, quantitative performance comparison is essential for choosing the appropriate implementation for a particular hardware configuration. In this work, we introduce a specialized set of micro-benchmarks to assess the scalability of the parallel algorithms in the STL. By selecting different backends, our micro-benchmarks can be used on multi-core systems and GPUs. Using the suite, in a case study on AMD and Intel CPUs and NVIDIA GPUs, we were able to identify substantial performance disparities among different implementations, including GCC+TBB, GCC+HPX, Intel's compiler with TBB, or NVIDIA's compiler with OpenMP and CUDA.
翻译:自C++17标准模板库(STL)引入并行算法以来,STL已成为创建性能可移植应用的可行框架。鉴于现有多种并行算法实现,系统性地定量比较其性能对于选择适合特定硬件配置的实现至关重要。本文提出一套专用微基准测试套件,用于评估STL中并行算法的可扩展性。通过选择不同的后端,该微基准测试套件可在多核系统和GPU上使用。在AMD和Intel CPU及NVIDIA GPU的案例研究中,我们利用该套件成功识别出不同实现(包括GCC+TBB、GCC+HPX、Intel编译器配合TBB、以及NVIDIA编译器配合OpenMP和CUDA)之间的显著性能差异。