Many-core accelerators are essential for high-performance deep learning, but their performance is undermined by widespread fail-slow failures. Detecting such failures on-chip is challenging, as prior methods from distributed systems are unsuitable due to strict memory limits and their inability to track failures across the hardware topology. This paper introduces SLOTH, a lightweight, hardware-aware framework for practical on-chip fail-slow detection in many-core accelerators. SLOTH combines workload-aware instrumentation for operator-level monitoring with minimal overhead, on-the-fly trace compression to operate within kilobytes of memory, and a novel topology-aware ranking algorithm to pinpoint a failure's root cause. We evaluate SLOTH on a wide range of representative DNN workloads. The results demonstrate that SLOTH reduces the storage overhead by an average of 115.9$\times$, while achieving an average fail-slow detection accuracy of 86.77% and a false positive rate (FPR) of 12.11%. More importantly, SLOTH scales effectively across different many-core accelerator architectures, making it practical for large-scale deployments.
翻译:多核加速器对于高性能深度学习至关重要,但其性能受到普遍存在的缓速故障的损害。在片上检测此类故障具有挑战性,因为来自分布式系统的现有方法由于严格的内存限制以及无法跨硬件拓扑跟踪故障而不适用。本文介绍了SLOTH,一个轻量级、硬件感知的框架,用于在多核加速器中实现实用的片上缓速故障检测。SLOTH结合了面向工作负载的轻量级算子级监控插桩、在千字节级内存内运行的实时轨迹压缩,以及一种新颖的拓扑感知排序算法来精确定位故障的根本原因。我们在多种代表性的DNN工作负载上评估了SLOTH。结果表明,SLOTH平均将存储开销降低了115.9倍,同时实现了平均86.77%的缓速故障检测准确率和12.11%的误报率。更重要的是,SLOTH能够有效地在不同多核加速器架构上扩展,使其适用于大规模部署。