Many-core accelerators are essential for high-performance deep learning, but their performance is undermined by widespread fail-slow failures. Detecting such failures on-chip is challenging, as prior methods from distributed systems are unsuitable due to strict memory limits and their inability to track failures across the hardware topology. This paper introduces SLOTH, a lightweight, hardware-aware framework for practical on-chip fail-slow detection in many-core accelerators. SLOTH combines workload-aware instrumentation for operator-level monitoring with minimal overhead, on-the-fly trace compression to operate within kilobytes of memory, and a novel topology-aware ranking algorithm to pinpoint a failure's root cause. We evaluate SLOTH on a wide range of representative DNN workloads. The results demonstrate that SLOTH reduces the storage overhead by an average of 115.9$\times$, while achieving an average fail-slow detection accuracy of 86.77% and a false positive rate (FPR) of 12.11%. More importantly, SLOTH scales effectively across different many-core accelerator architectures, making it practical for large-scale deployments.
翻译:暂无翻译