A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

Heterogeneous nodes that combine multi-core CPUs with diverse accelerators are rapidly becoming the norm in both high-performance computing (HPC) and AI infrastructures. Exploiting these platforms, however, requires orchestrating several low-level accelerator APIs such as CUDA, SYCL, and Triton. In some occasions they can be combined with optimized vendor math libraries: e.g., cuBLAS and oneAPI. Each API or library introduces its own abstractions, execution semantics, and synchronization mechanisms. Combining them within a single application is therefore error-prone and labor-intensive. We propose reusing a task-based data-flow methodology together with Task-Aware APIs (TA-libs) to overcome these limitations and facilitate the seamless integration of multiple accelerator programming models, while still leveraging the best-in-class kernels offered by each API. Applications are expressed as a directed acyclic graph (DAG) of host tasks and device kernels managed by an OpenMP/OmpSs-2 runtime. We introduce Task-Aware SYCL (TASYCL) and leverage Task-Aware CUDA (TACUDA), which elevate individual accelerator invocations to first-class tasks. When multiple native runtimes coexist on the same multi-core CPU, they contend for threads, leading to oversubscription and performance variability. To address this, we unify their thread management under the nOS-V tasking and threading library, to which we contribute a new port of the PoCL (Portable OpenCL) runtime. These results demonstrate that task-aware libraries, coupled with the nOS-V library, enable a single application to harness multiple accelerator programming models transparently and efficiently. The proposed methodology is immediately applicable to current heterogeneous nodes and is readily extensible to future systems that integrate even richer combinations of CPUs, GPUs, FPGAs, and AI accelerators.

翻译：结合多核CPU与多种加速器的异构节点正迅速成为高性能计算（HPC）与人工智能基础设施的常态。然而，充分利用此类平台需要协调多种底层加速器API，如CUDA、SYCL与Triton。在某些情况下，这些API可与优化的厂商数学库（如cuBLAS与oneAPI）结合使用。每种API或库均引入其自身的抽象、执行语义与同步机制，因此在单一应用内组合使用它们极易出错且开发效率低下。我们提出复用任务型数据流方法，并结合任务感知API（TA-libs）以克服这些局限，促进多种加速器编程模型的无缝集成，同时仍能充分利用各API提供的最佳内核。应用被表述为由OpenMP/OmpSs-2运行时管理的宿主任务与设备内核构成的有向无环图（DAG）。我们引入了任务感知SYCL（TASYCL）并利用任务感知CUDA（TACUDA），将独立的加速器调用提升为一级任务。当多个原生运行时共存于同一多核CPU时，它们会竞争线程资源，导致过度订阅与性能波动。为解决此问题，我们在nOS-V任务与线程库下统一了线程管理，并为之贡献了PoCL（便携式OpenCL）运行时的新移植版本。实验结果表明，任务感知库与nOS-V库相结合，能使单一应用透明且高效地利用多种加速器编程模型。所提方法可直接应用于当前异构节点，并易于扩展至未来集成更丰富CPU、GPU、FPGA与AI加速器组合的系统中。