This paper presents efforts to improve the hierarchical parallelism of a two scale simulation code. Two methods to improve the GPU parallel performance were developed and compared. The first used the NVIDIA Multi-Process Service and the second moved the entire sub-problem loop into a single kernel using Kokkos hierarchical parallelism and a PackedView data structure. Both approaches improved parallel performance with the second method providing the greatest improvements.
翻译:本文致力于提升双尺度仿真代码的层次并行性能。我们开发并比较了两种改进GPU并行性能的方法。第一种方法基于NVIDIA多进程服务,第二种方法则利用Kokkos层次并行与PackedView数据结构,将整个子问题循环集成至单一内核。两种方法均提升了并行性能,其中第二种方法带来的改进最为显著。