From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

Daemyung Kang,Eunjin Hwang,Hanjeong Lee,HyeokJin Kim,Hyunhoi Koo,Jeongkyu Shin,Jeongseok Kang,Jihyun Kang,Jinho Heo,Joongi Kim,Junbum Lee,Jungseung Yang,Kyujin Cho,Youngsook Song

from arxiv, 42 pages, 19 figures, 16 tables. Lablup Technical Report

Large-scale AI training is fundamentally a distributed systems problem, where hardware failures are routine operating conditions rather than rare exceptions, yet public operational evidence from production training clusters remains limited. This report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The environment is cross-organizational: five parties (SKT, Upstage, Lablup, NVIDIA Korea, VAST Data) share a unified monitoring pipeline. This enabled joint diagnosis of a 60-node-scale storage I/O bottleneck absent in 2-4-node tests, a production-scale phenomenon no single team could isolate alone. We perform three quantitative analyses yielding four findings. First, over 751 Prometheus metrics and 10 XID-identified GPU failures, no single metric is consistently dominant across failure types, motivating multi-signal detection. Second, 523 checkpoint events trace the save/load path from GPU VRAM to the NFS server: restart loading reaches 21.5% of maximum read bandwidth (700 GB/s) and save bursts 16.0% of maximum write bandwidth (250 GB/s), with NFS/RPC queueing and transport-layer backlog rising together. Third, across 224 sessions over 73 days, node exclusions concentrate so the top 3 of 63 nodes account for over 50%. Fourth, auto-retry chain analysis shows a 33.3% success rate over 12 chains (73 attempts), 2.7x the 12.5% manual rate, with a median retry interval of 11 minutes (IQR 10-11). All analyses are grounded in production infrastructure providing session-level workload management, GPU-centric scheduling, and unified observability.

翻译：大规模AI训练本质上是一个分布式系统问题，其中硬件故障并非罕见异常，而是常规运行条件，然而来自生产训练集群的公开运维证据仍然有限。本报告对包含63个节点的NVIDIA B200生产集群（504块GPU）进行了实证分析，基于55天的Prometheus时间序列数据和73天涵盖224次多节点训练会话的运维日志。该环境具有跨组织特性：五方（SKT、Upstage、Lablup、NVIDIA Korea、VAST Data）共享统一的监控流水线。这使得我们能够联合诊断在2-4节点测试中不存在的60节点级存储I/O瓶颈——这一生产级现象无法由单个团队独立隔离。我们开展了三项定量分析，得出四项发现。第一，在751个Prometheus指标和10个XID标识的GPU故障中，没有任何单一指标在所有故障类型中占据主导地位，这推动了多信号检测的需求。第二，523个检查点事件追踪了从GPU显存到NFS服务器的保存/加载路径：重启加载达到最大读取带宽（700 GB/s）的21.5%，保存突发达到最大写入带宽（250 GB/s）的16.0%，同时NFS/RPC排队与传输层积压同步上升。第三，在73天内的224次会话中，节点排除呈现集中趋势，63个节点中排前3的节点占比超过50%。第四，自动重试链分析显示，12条链（73次尝试）的成功率为33.3%，是手动重试率12.5%的2.7倍，中位重试间隔为11分钟（四分位距10-11分钟）。所有分析均基于提供会话级工作负载管理、GPU中心调度和统一可观测性的生产基础设施。