Process mapping asks to assign vertices of a task graph to processing elements of a supercomputer such that the computational workload is balanced while the communication cost is minimized. Motivated by the recent success of GPU-based graph partitioners, we propose two GPU-accelerated algorithms for this optimization problem. The first algorithm employs hierarchical multisection, which partitions the task graph alongside the hierarchy of the supercomputer. The method utilizes GPU-based graph partitioners to accelerate the mapping process. The second algorithm integrates process mapping directly into the modern multilevel graph partitioning pipeline. Vital phases like coarsening and refinement are accelerated by exploiting the parallelism of GPUs. The first algorithm has, on average, about 12 percent higher communication costs than the state-of-the-art solver and thus remains competitive with it. However, in terms of speed, it vastly outperforms the competitor with a geometric mean speedup of 22 times and a maximum speedup of 934 times. The second approach is even faster, with a geometric mean speedup of 1454 times and a peak speedup of 12376 times. Compared to other algorithms that prioritize speed over solution quality, this approach has the same quality but much greater speedups. To our knowledge, these are the first GPU-based algorithms for process mapping.
翻译:进程映射问题旨在将任务图的顶点分配给超级计算机的处理单元,使得计算负载均衡的同时通信成本最小化。受近期基于GPU的图划分器成功应用的启发,我们针对此优化问题提出了两种GPU加速算法。第一种算法采用层次化多划分策略,沿超级计算机的层次结构对任务图进行划分。该方法利用基于GPU的图划分器来加速映射过程。第二种算法将进程映射直接集成到现代多级图划分流程中。通过利用GPU的并行性,关键阶段如图粗化和细化得到了加速。第一种算法的通信成本平均比当前最先进的求解器高出约12%,因此仍具有竞争力。然而在速度方面,它远超竞争对手,几何平均加速比达到22倍,最大加速比高达934倍。第二种方法速度更快,几何平均加速比为1454倍,峰值加速比达到12376倍。与其他优先考虑速度而非求解质量的算法相比,该方法在保持相同求解质量的同时实现了更高的加速比。据我们所知,这是首个基于GPU的进程映射算法。