Collective operations are cornerstones of both HPC applications and large-scale AI training and inference, yet benchmarking them in a systematic and reproducible way remains difficult on modern systems due to the complexity of their hardware and software stacks. Existing suites primarily report end-to-end timings and offer limited support for controlled algorithm and configuration selection, fine-grained profiling, and capturing the runtime environment. We present PICO (Performance Insights for Collective Operations), an open-source framework that decouples portable experiment setup from platform execution, provides a backend-adaptive parameter selection interface across MPI and NCCL, supplies plain-MPI reference collective implementations, optionally instrumentable, and records the system configuration for reproducible comparisons. Evaluated on three major supercomputers, PICO shows that default collective algorithms and transport settings can be up to $5\times$ slower than the best available choice. It provides diagnostic evidence by isolating topology sensitive algorithmic choices and, through instrumentation, reveals detailed algorithmic breakdowns. To assess end-to-end effects of benchmark-informed tuning and evaluate application-level impacts, we replay open-source LLM training traces in ATLAHS simulator with optimized collective profiles identified by PICO, achieving reductions in training times of up to $44\%$.
翻译:集合操作既是高性能计算应用也是大规模AI训练与推理的基石,然而,由于现代系统的硬件与软件栈复杂性,以系统化且可重复的方式对其进行基准测试仍然困难。现有测试套件主要报告端到端时间,对受控算法与配置选择、细粒度分析以及运行时环境捕获的支持有限。我们提出PICO(集合操作性能洞察),这是一个开源框架,它将可移植实验设置与平台执行解耦,提供跨MPI和NCCL的后端自适应参数选择接口,提供可选的纯MPI参考集合实现(可插桩),并记录系统配置以实现可重复比较。在三大超级计算机上的评估表明,默认集合算法和传输设置可能比最佳可用选择慢高达$5\times$。PICO通过隔离拓扑敏感的算法选择提供诊断证据,并通过插桩揭示详细的算法分解。为评估基准调优的端到端效果以及应用级影响,我们在ATLAHS模拟器中重放开源LLM训练轨迹,并采用PICO识别出的优化集合配置,实现了高达$44\%$的训练时间缩减。