From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

Daemyung Kang,Eunjin Hwang,Hanjeong Lee,HyeokJin Kim,Hyunhoi Koo,Jeongkyu Shin,Jeongseok Kang,Jihyun Kang,Jinho Heo,Joongi Kim,Junbum Lee,Jungseung Yang,Kyujin Cho,Youngsook Song

from arxiv, 42 pages, 19 figures, 16 tables. Lablup Technical Report

Large-scale AI training is fundamentally a distributed systems problem, where hardware failures are routine operating conditions rather than rare exceptions, yet public operational evidence from production training clusters remains limited. This report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The environment is cross-organizational: five parties (SKT, Upstage, Lablup, NVIDIA Korea, VAST Data) share a unified monitoring pipeline. This enabled joint diagnosis of a 60-node-scale storage I/O bottleneck absent in 2-4-node tests, a production-scale phenomenon no single team could isolate alone. We perform three quantitative analyses yielding four findings. First, over 751 Prometheus metrics and 10 XID-identified GPU failures, no single metric is consistently dominant across failure types, motivating multi-signal detection. Second, 523 checkpoint events trace the save/load path from GPU VRAM to the NFS server: restart loading reaches 21.5% of maximum read bandwidth (700 GB/s) and save bursts 16.0% of maximum write bandwidth (250 GB/s), with NFS/RPC queueing and transport-layer backlog rising together. Third, across 224 sessions over 73 days, node exclusions concentrate so the top 3 of 63 nodes account for over 50%. Fourth, auto-retry chain analysis shows a 33.3% success rate over 12 chains (73 attempts), 2.7x the 12.5% manual rate, with a median retry interval of 11 minutes (IQR 10-11). All analyses are grounded in production infrastructure providing session-level workload management, GPU-centric scheduling, and unified observability.

翻译：大规模AI训练本质上是一个分布式系统问题，硬件故障是常规运行条件而非罕见例外，然而来自生产训练集群的公开运维证据仍然有限。本报告基于63节点NVIDIA B200生产集群（504块GPU），利用55天的Prometheus时间序列数据和涵盖224次多节点训练会话的73天运维日志进行实证分析。该环境具有跨组织特性：五方机构（SKT、Upstage、Lablup、NVIDIA韩国、VAST Data）共享统一监控管道，由此实现了对2-4节点测试中未出现的60节点规模存储I/O瓶颈的联合诊断——这是单一团队无法独立隔离的生产级现象。我们通过三项定量分析得出四个发现：第一，在751个Prometheus指标和10个XID识别的GPU故障中，没有任何单一指标在各类故障中始终占据主导地位，这启发了多信号检测方法；第二，523个检查点事件追踪了从GPU显存到NFS服务器的保存/加载路径：重启加载达到最大读取带宽（700 GB/s）的21.5%，保存突发达到最大写入带宽（250 GB/s）的16.0%，同时NFS/RPC队列和传输层积压同步增加；第三，在73天224个会话中，节点排除呈现集中模式——63个节点中排名前三的节点占比超过50%；第四，自动重试链分析显示，12个重试链（73次尝试）的成功率为33.3%，是人工重试率12.5%的2.7倍，其中位重试间隔为11分钟（四分位距10-11分钟）。所有分析均基于提供会话级工作负载管理、GPU中心化调度和统一可观测性的生产基础设施。