UCX is a communication framework that enables low-latency, high-bandwidth communication in HPC systems. With its unified API, UCX facilitates efficient data transfers across multi-node CPU-GPU clusters. UCX is widely used as the transport layer for MPI, particularly in GPU-aware implementations. However, existing profiling tools lack fine-grained communication traces at the UCX level, do not capture transport-layer behavior, or are limited to specific MPI implementations. To address these gaps, we introduce ucTrace, a novel profiler that exposes and visualizes UCX-driven communication in HPC environments. ucTrace provides insights into MPI workflows by profiling message passing at the UCX level, linking operations between hosts and devices (e.g., GPUs and NICs) directly to their originating MPI functions. Through interactive visualizations of process- and device-specific interactions, ucTrace helps system administrators, library and application developers optimize performance and debug communication patterns in large-scale workloads. We demonstrate ucTrace's features through a wide range of experiments including MPI point-to-point behavior under different UCX settings, Allreduce comparisons across MPI libraries, communication analysis of a linear solver, NUMA binding effects, and profiling of GROMACS MD simulations with GPU acceleration at scale. ucTrace is publicly available at https://github.com/ParCoreLab/ucTrace.
翻译:UCX是一种通信框架,可在高性能计算系统中实现低延迟、高带宽的通信。通过其统一API,UCX促进了多节点CPU-GPU集群间的高效数据传输。UCX被广泛用作MPI的传输层,特别是在GPU感知的实现中。然而,现有性能分析工具缺乏UCX层面的细粒度通信追踪,无法捕获传输层行为,或仅限于特定的MPI实现。为弥补这些不足,我们提出了ucTrace——一种在高性能计算环境中揭示并可视化UCX驱动通信的新型性能分析器。ucTrace通过在UCX层面分析消息传递,将主机与设备(如GPU和网卡)间的操作直接关联到其源头的MPI函数,从而深入揭示MPI工作流程。通过进程级和设备级交互式可视化,ucTrace帮助系统管理员、库开发者和应用程序开发者优化大规模工作负载的性能并调试通信模式。我们通过一系列实验展示了ucTrace的功能,包括不同UCX设置下的MPI点对点行为、跨MPI库的Allreduce操作比较、线性求解器的通信分析、NUMA绑定效应,以及大规模GPU加速的GROMACS分子动力学模拟的性能分析。ucTrace已在https://github.com/ParCoreLab/ucTrace 公开提供。