Unbalanced optimal transport (UOT) has been widely used as a fundamental tool in many application domains, where it often dominates the application running time. While many researchers have proposed various optimizations for UOT, few have attempted to optimize it from a computer architecture's perspective. In this paper, we first study the performance bottlenecks of UOT through a series of experiments, which reveals that UOT is heavily memory-bound. Guided by these findings, we propose MAP-UOT, a Memory-efficient APproach to the implementation and optimization of UOT on CPU and GPU platforms. Our experimental evaluations show that the proposed strategy consistently and significantly outperforms the state-of-the-art (SOTA) implementations. Specifically, it provides single-threaded performance improvement over POT/COFFEE by up to 2.9X/2.4X, with an average of 1.9X/1.6X. At the same time, it provides parallelized performance improvement over POT/COFFEE by up to 2.4X/1.9X, with an average of 2.2X/1.8X, on Intel Core i9-12900K; and over POT by up to 3.5X, with an average of 1.6X, on Nvidia GeForce RTX 3090 Ti. MAP-UOT also shows great performance improvement on the Tianhe-1 supercomputer.
翻译:失衡最优传输(UOT)作为一项基础工具已被广泛应用于众多领域,其计算过程往往主导着应用程序的运行时间。尽管已有诸多研究者针对UOT提出了多种优化方案,但鲜有从计算机体系结构视角进行优化的尝试。本文首先通过系列实验研究UOT的性能瓶颈,揭示其具有显著的内存访问受限特性。基于此发现,我们提出MAP-UOT——一种面向CPU与GPU平台的内存高效UOT实现与优化方法。实验评估表明,该策略在性能上持续且显著优于现有最优(SOTA)实现方案。具体而言,在单线程执行中较POT/COFFEE最高提升2.9倍/2.4倍,平均提升1.9倍/1.6倍;在并行化执行中,于Intel Core i9-12900K平台上较POT/COFFEE最高提升2.4倍/1.9倍,平均提升2.2倍/1.8倍;在Nvidia GeForce RTX 3090 Ti平台上较POT最高提升3.5倍,平均提升1.6倍。MAP-UOT在"天河一号"超级计算机上也展现出显著的性能提升。