Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains. First, we introduce XQueue, a lock-less concurrent queue implementation to replace GNU's priority task queue and remove the global task lock. Second, we develop a scalable, efficient, and hybrid lock-free/lock-less distributed tree barrier to address the high hardware synchronization overhead from GNU's centralized barrier. Third, we develop two lock-less and NUMA-aware load balancing strategies. We evaluate our implementation using Barcelona OpenMP Task Suite (BOTS) benchmarks. Results from the first and second advances demonstrate up to 1522.8$\times$ performance improvement compared to the original GNU OpenMP. Further improvements from lock-less load balancing show up to 4$\times$ improvement compared to GNU OpenMP using XQueue. Through a rich set of profiling and instrumentation tools, we are able to investigate the runtime behavior of GNU OpenMP and improve its performance on fine-grained tasks by many orders of magnitude.
翻译:在多核架构上实现高效的任务并行是一项重要挑战。广泛使用的GNU OpenMP作为流行OpenMP并行编程模型的实现,在细粒度短时任务上因运行时同步开销而产生较高成本。本研究提出并分析了三项关键改进,共同实现了显著的性能提升。首先,我们引入XQueue——一种无锁并发队列实现,以替代GNU的优先级任务队列并消除全局任务锁。其次,我们开发了可扩展、高效的混合无锁/免锁分布式树形屏障,以解决GNU集中式屏障带来的高硬件同步开销。第三,我们提出了两种具备NUMA感知能力的免锁负载均衡策略。我们使用巴塞罗那OpenMP任务套件(BOTS)基准测试评估改进方案。前两项改进相较于原始GNU OpenMP实现了最高1522.8倍的性能提升。免锁负载均衡策略在XQueue基础上进一步优化,相比采用XQueue的GNU OpenMP最高可获得4倍加速。通过丰富的性能剖析与检测工具,我们深入探究了GNU OpenMP的运行时行为,将其细粒度任务性能提升了多个数量级。