Nowadays, latency-critical, high-performance applications are parallelized even on power-constrained client systems to improve performance. However, an important scenario of fine-grained tasking on simultaneous multithreading CPU cores in such systems has not been well researched in previous works. Hence, in this paper, we conduct performance analysis of state-of-the-art shared-memory parallel programming frameworks on simultaneous multithreading cores using real-world fine-grained application kernels. We introduce a specialized and simple software-only parallel programming framework called Relic to enable extremely fine-grained tasking on simultaneous multithreading cores. Using Relic framework, we increase performance speedups over serial implementations of benchmark kernels by 19.1% compared to LLVM OpenMP, by 31.0% compared to GNU OpenMP, by 20.2% compared to Intel OpenMP, by 33.2% compared to X-OpenMP, by 30.1% compared to oneTBB, by 23.0% compared to Taskflow, and by 21.4% compared to OpenCilk.
翻译:如今,即使在功耗受限的客户端系统上,延迟关键的高性能应用也通过并行化来提升性能。然而,此类系统中同步多线程CPU核上的细粒度任务处理这一重要场景在以往的研究中尚未得到充分探讨。因此,本文利用真实的细粒度应用内核,对同步多线程核上先进的共享内存并行编程框架进行了性能分析。我们引入了一个专门且简单的纯软件并行编程框架Relic,以实现在同步多线程核上的极细粒度任务处理。使用Relic框架,相较于基准内核的串行实现,我们获得的性能加速比相比LLVM OpenMP提高了19.1%,相比GNU OpenMP提高了31.0%,相比Intel OpenMP提高了20.2%,相比X-OpenMP提高了33.2%,相比oneTBB提高了30.1%,相比Taskflow提高了23.0%,相比OpenCilk提高了21.4%。