Graphics Processing Units (GPUs) excel at regular data-parallel workloads where massive hardware parallelism can be readily exploited. In contrast, many important irregular applications are naturally expressed as task parallelism with a fork-join control structure. While CPU runtimes for fork-join task parallelism are mature, it remains challenging to efficiently support it on GPUs. We propose GTaP, a GPU-resident runtime that supports fork-join task parallelism. GTaP is based on the persistent kernel model, and supports two worker granularities: thread blocks and individual threads. To realize fork-join on GPUs, GTaP represents joins as continuations and executes each task as a state machine that can be split into multiple execution segments. We also extend Clang's frontend with a pragma-based programming model that enables programmers to express fork-join without exposing low-level mechanisms. GTaP employs work stealing for load balancing, providing better scalability than a global-queue approach. For thread-level workers, we further introduce Execution-Path-Aware Queueing (EPAQ), which allows programmers to partition task queues using user-defined criteria, reducing warp divergence caused by mixing heterogeneous control flows within a warp. Across representative irregular applications, GTaP outperforms OpenMP task-parallel execution on a 72-core CPU in many cases, especially for large problem sizes with compute-intensive tasks. We also show that GTaP's design choices outperform naive GPU alternatives. The benefit of EPAQ is workload-dependent: it can improve performance for some benchmarks while having little effect on others; on Fibonacci, EPAQ achieves up to a 1.8$\times$ speedup.
翻译:图形处理器(GPU)在可充分利用大规模硬件并行性的规则数据并行工作负载中表现卓越。然而,许多重要的不规则应用天然表现为具有分叉-连接控制结构的任务并行性。尽管支持分叉-连接任务并行的CPU运行时系统已趋于成熟,但在GPU上高效实现该机制仍具挑战性。我们提出GTaP,一种驻留GPU的运行时系统,可支持分叉-连接任务并行性。GTaP基于持久内核模型,支持两种工作者粒度:线程块和独立线程。为在GPU上实现分叉-连接机制,GTaP将连接表示为延续,并将每个任务作为可拆分为多个执行片段的状态机执行。我们还扩展了Clang的前端,引入基于编译指示的编程模型,使程序员无需暴露底层机制即可表达分叉-连接操作。GTaP采用工作窃取实现负载均衡,相较全局队列方法具备更优的可扩展性。针对线程级工作者,我们进一步提出执行路径感知排队(EPAQ),允许程序员通过自定义规则划分任务队列,从而减少因混合异构控制流导致的线程束发散。在典型不规则应用中,GTaP在多数情况下优于72核CPU上的OpenMP任务并行执行,尤其在处理计算密集型的大规模问题时优势显著。我们还表明,GTaP的设计选择优于朴素的GPU替代方案。EPAQ的收益取决于工作负载特性:它能够提升部分基准测试的性能,但对其他基准测试影响甚微;在斐波那契数列计算中,EPAQ最高可实现1.8倍的加速比。