Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains. First, we introduce XQueue, a lock-less concurrent queue implementation to replace GNU's priority task queue and remove the global task lock. Second, we develop a scalable, efficient, and hybrid lock-free/lock-less distributed tree barrier to address the high hardware synchronization overhead from GNU's centralized barrier. Third, we develop two lock-less and NUMA-aware load balancing strategies. We evaluate our implementation using Barcelona OpenMP Task Suite (BOTS) benchmarks. We show that the use of XQueue and the distributed tree barrier can improve performance by up to 1522.8$\times$ compared to the original GNU OpenMP. We further show that lock-less load balancing can improve performance by up to 4$\times$ compared to GNU OpenMP using XQueue.
翻译:在众核架构上实现高效的任务并行是一项重要挑战。由于运行时同步开销,广泛使用的GNU OpenMP并行编程模型实现在处理细粒度、短时任务时会产生较高开销。本研究提出并分析了三项关键改进,共同实现了显著的性能提升。首先,我们引入XQueue——一种无锁并发队列实现,以替代GNU的优先级任务队列并消除全局任务锁。其次,我们开发了可扩展、高效的混合无锁/免锁分布式树形屏障,以解决GNU集中式屏障带来的高硬件同步开销。第三,我们提出了两种具备NUMA感知能力的免锁负载均衡策略。我们使用巴塞罗那OpenMP任务套件(BOTS)基准测试评估了改进方案。实验表明,采用XQueue和分布式树形屏障相比原始GNU OpenMP可实现最高1522.8倍的性能提升。进一步研究表明,在已使用XQueue的基础上,免锁负载均衡策略相比GNU OpenMP还能带来最高4倍的额外性能提升。