Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the CPU to orchestrate the data transfer operations. A new offload-friendly communication strategy, stream-triggered (ST) communication, was explored to allow offloading the synchronization and data movement operations from the CPU to the GPU. A Message Passing Interface (MPI) one-sided active target synchronization based implementation was used as an exemplar to illustrate the proposed strategy. A latency-sensitive nearest neighbor microbenchmark was used to explore the various performance aspects of the implementation. The offloaded implementation shows significant on-node performance advantages over standard MPI active RMA (36%) and point-to-point (61%) communication. The current multi-node improvement is less (23% faster than standard active RMA but 11% slower than point-to-point), but plans are in progress to purse further improvements.
翻译:现代异构超级计算系统由CPU、GPU以及高速网络互连组成。支持从GPU内存缓冲区高效传输数据的通信库通常需要CPU来协调数据传输操作。本文探索了一种新的卸载友好型通信策略——流触发(ST)通信,该策略允许将同步和数据移动操作从CPU卸载到GPU。使用基于消息传递接口(MPI)的单边主动目标同步实现作为范例,以说明所提出的策略。采用延迟敏感的最近邻微基准测试来探索该实现的各种性能方面。卸载后的实现在节点内性能上显著优于标准MPI主动RMA通信(提升36%)和点对点通信(提升61%)。当前多节点性能改进较小(比标准主动RMA快23%,但比点对点通信慢11%),但计划正在推进以寻求进一步改进。