Performance diagnosis in production-scale AI training is challenging because subtle OS-level issues can trigger cascading GPU delays and network slowdowns, degrading training efficiency across thousands of GPUs. Existing profiling tools are limited to single system layers, incur prohibitive overhead (10--30%), or lack continuous deployment capabilities, resulting in manual analyses spanning days. We argue that continuous, cross-layer observability enabled by OS-level instrumentation and layered differential diagnosis is necessary to address this gap. We introduce SysOM-AI, a production observability system that continuously integrates CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via adaptive hybrid stack unwinding and eBPF-based tracing, incurring less than 0.4% overhead. Deployed at Alibaba across over 80,000 GPUs for more than one year, SysOM-AI helped diagnose 94 confirmed production issues, reducing median diagnosis time from days to approximately 10 minutes.
翻译:生产级AI训练中的性能诊断极具挑战性,因为细微的操作系统层面问题可能引发级联式GPU延迟与网络性能下降,导致数千块GPU的训练效率降低。现有剖析工具局限于单一系统层,存在难以承受的开销(10%~30%),或缺乏持续部署能力,使得诊断需耗费数天进行人工分析。我们认为,基于操作系统层级的监测手段与分层差异诊断实现的持续、跨层可观测性,是解决该问题所必需的。我们提出SysOM-AI——一套生产环境可观测性系统,通过自适应混合堆栈展开与eBPF追踪技术,持续集成CPU堆栈剖析、GPU内核追踪及NCCL事件监测,开销低于0.4%。该系统在阿里巴巴部署于超过80,000块GPU上运行逾一年,累计诊断94个经确认的生产环境问题,将中位诊断时间从数天缩短至约10分钟。