From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

Daemyung Kang,Eunjin Hwang,Hanjeong Lee,HyeokJin Kim,Hyunhoi Koo,Jeongkyu Shin,Jeongseok Kang,Jihyun Kang,Joongi Kim,Junbum Lee,Jungseung Yang,Kyujin Cho,Youngsook Song

from arxiv, 42 pages, 19 figures, 16 tables. Lablup Technical Report

Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions rather than rare exceptions. Public operational evidence from production training clusters, however, remains scarce. This technical report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The cluster operates within a cross-organizational environment in which five parties (SKT, Upstage, Lablup, NVIDIA Korea, and VAST Data) share a unified monitoring pipeline. This arrangement enabled joint diagnosis of a 60-node-scale storage I/O bottleneck that did not appear at 2-4-node scale, a production-scale phenomenon no single team could isolate alone. Drawing on a months-long pre-training campaign, we perform three quantitative analyses yielding four findings. First, statistical analysis over 751 Prometheus metrics and 10 XID-identified GPU failures achieves a 10/10 detection rate (2/10 pre-XID) at ~0.84 false positives per day. No single metric is consistently dominant across failure types, motivating a multi-signal detection strategy. Second, profiling 523 checkpoint events along the GPU VRAM to NFS path attributes the "bandwidth paradox" (1.4-10.4% utilization of 200 Gbps RoCE) to saturation of the 128-slot NFS RPC layer. Third, multi-node failure response shows concentrated exclusions (top 3 of 63 nodes account for >50% of all exclusions) and an auto-retry chain success rate of 33.3% over 12 chains (73 attempts), 2.7x the 12.5% manual recovery rate; the median retry interval is 11 min (IQR 10-11). All analyses are grounded in production infrastructure providing session-level workload management, GPU-centric scheduling, and unified observability.

翻译：大规模AI训练本质上已是一个分布式系统问题，硬件故障成为常规运行条件而非罕见例外。然而，生产训练集群的公开运维证据仍然稀缺。本技术报告呈现了对一个63节点NVIDIA B200生产集群（504 GPU）的实证分析，使用了55天Prometheus时序数据和73天覆盖224次多节点训练会话的运维日志。该集群运行于跨组织环境中，五个参与方（SKT、Upstage、Lablup、NVIDIA Korea与VAST Data）共享统一监控管道。这一安排使得联合诊断成为可能：一个在2-4节点规模未出现的60节点级存储I/O瓶颈——这种生产级现象单靠任一团队均无法独立定位。基于持续数月的预训练任务，我们开展三项定量分析并得出四项发现。第一，对751个Prometheus指标和10个XID标识GPU故障的统计分析实现了10/10检出率（XID前2/10），每日假阳性约0.84个。没有任何单一指标在各类故障中持续占优，这推动了多信号检测策略。第二，沿GPU显存到NFS路径对523个检查点事件的分析，将“带宽悖论”（1.4-10.4%的200 Gbps RoCE利用率）归因于128槽NFS RPC层饱和。第三，多节点故障响应显示集中排除现象（63节点中前3个节点占总排除量50%以上）和自动重试链成功率33.3%（12条链共73次尝试），较12.5%的手动恢复率高2.7倍；重试间隔中位数为11分钟（IQR 10-11）。所有分析均基于提供会话级工作负载管理、GPU中心调度和统一可观测性的生产基础设施。