Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, and communication. This combination allows us to have a fine interleaving between portable CPU/GPU computations and communication, enabling scalability on various supercomputers. However, for HPX and Kokkos to work together optimally, we need to be able to treat Kokkos kernels as HPX tasks. Otherwise, instead of integrating asynchronous Kokkos kernel launches into HPX's task graph, we would have to actively wait for them with fence commands, which wastes CPU time better spent otherwise. Using an integration layer called HPX-Kokkos, treating Kokkos kernels as tasks already works for some Kokkos execution spaces (like the CUDA one), but not for others (like the SYCL one). In this work, we started making Octo-Tiger and HPX itself compatible with SYCL. To do so, we introduce numerous software changes, most notably an HPX-SYCL integration. This integration allows us to treat SYCL events as HPX tasks, which in turn allows us to better integrate Kokkos by extending the support of HPX-Kokkos to also fully support Kokkos' SYCL execution space. We show two ways to implement this HPX-SYCL integration and test them using Octo-Tiger and its Kokkos kernels, on both an NVIDIA A100 and an AMD MI100. We find modest, yet noticeable, speedups by enabling this integration, even when just running simple single-node scenarios with Octo-Tiger where communication and CPU utilization are not yet an issue.
翻译:从NVIDIA GPU到AMD GPU再到Intel GPU:鉴于当前超级计算机中可用加速卡的高度异构性,可移植性是现代高性能计算(HPC)应用的关键。在Octo-Tiger中,我们依赖Kokkos及其多种执行空间来实现可移植的计算内核。同时,我们使用HPX来协调内核启动、CPU任务和通信。这种组合使我们能够实现可移植CPU/GPU计算与通信之间的精细交织,从而在各种超级计算机上实现可扩展性。然而,为使HPX和Kokkos最佳协同工作,我们需要能够将Kokkos内核视为HPX任务。否则,我们将不得不用屏障命令等待异步Kokkos内核启动,而非将其集成到HPX的任务图中,这浪费了本可更有效利用的CPU时间。通过使用名为HPX-Kokkos的集成层,将Kokkos内核视为任务的方法已对部分Kokkos执行空间(如CUDA)奏效,但对其他空间(如SYCL)尚不支持。本研究着手使Octo-Tiger及HPX本身与SYCL兼容。为此,我们引入了多项软件变更,其中最重要的是HPX-SYCL集成。该集成使我们能将SYCL事件视为HPX任务,进而通过扩展HPX-Kokkos对Kokkos SYCL执行空间的完全支持来更好地整合Kokkos。我们展示了实现此HPX-SYCL集成的两种方法,并利用Octo-Tiger及其Kokkos内核在NVIDIA A100和AMD MI100上进行了测试。研究发现,即使仅运行Octo-Tiger的简单单节点场景(此时通信和CPU利用率尚未成为瓶颈),启用此集成也能带来虽小但显著的速度提升。