GPU nodes are central to modern HPC and AI workloads, yet many failures do not manifest as immediate hard faults. While some instabilities emerge gradually as weak thermal or efficiency drift, a significant class occurs abruptly with little or no numeric precursor. In these detachment-class failures, GPUs become unavailable at the driver or interconnect level and the dominant observable signal is structural, including disappearance of device metrics and degradation of monitoring payload integrity. This paper proposes an observability-aware early-warning framework that jointly models (i) utilization-aware thermal drift signatures in GPU telemetry and (ii) monitoring-pipeline degradation indicators such as scrape latency increase, sample loss, time-series gaps, and device-metric disappearance. The framework is evaluated on production telemetry from GPU nodes at GWDG, where GPU, node, monitoring, and scheduler signals can be correlated. Results show that detachment failures exhibit minimal numeric precursor and are primarily observable through structural telemetry collapse, while joint modeling increases early-warning lead time compared to GPU-only detection. The dataset used in this study is publicly available at https://doi.org/10.5281/zenodo.19052367.
翻译:GPU节点是现代高性能计算和人工智能工作负载的核心,但许多故障并非以即时硬件错误的形式表现出来。虽然某些不稳定性会以微弱的热漂移或能效漂移的形式逐渐出现,但有一类重要故障会在几乎没有数值前兆的情况下突然发生。在这类"脱离型"故障中,GPU在驱动或互连层面变得不可用,主要的可观测信号是结构性的,包括设备指标的消失和监控载荷完整性的退化。本文提出了一种面向可观测性的早期预警框架,该框架联合建模:(i)GPU遥测中基于利用率的温度漂移特征,以及(ii)监控流水线退化指标,如抓取延迟增加、样本丢失、时间序列中断和设备指标消失。该框架在GWDG的GPU节点生产遥测数据上进行了评估,这些节点可实现GPU、节点、监控和调度信号的协同关联。结果表明,脱离型故障表现出极少的数值前兆,主要通过结构性遥测崩塌来观测,而联合建模相比于仅依赖GPU的检测方法能增加早期预警的提前时间。本研究所用数据集公开可获取于:https://doi.org/10.5281/zenodo.19052367