Structured sparsity is an efficient way to prune the complexity of modern Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. In such cases, the acceleration of structured-sparse ML models is handled by sparse systolic tensor arrays. The increasing prevalence of ML in safety-critical systems requires enhancing the sparse tensor arrays with online error detection for managing random hardware failures. Algorithm-based fault tolerance has been proposed as a low-cost mechanism to check online the result of computations against random hardware failures. In this work, we address a key architectural challenge with structured-sparse tensor arrays: how to provide online error checking for a range of structured sparsity levels while maintaining high utilization of the hardware. Experimental results highlight the minimum hardware overhead incurred by the proposed checking logic and its error detection properties after injecting random hardware faults on sparse tensor arrays that execute layers of ResNet50 CNN.
翻译:结构化稀疏是一种高效的方法,可降低现代机器学习应用的复杂性,并简化硬件中稀疏数据的处理。在此类场景下,结构化稀疏机器学习模型的加速由稀疏脉动张量阵列实现。随着机器学习在安全关键系统中的日益普及,需要为稀疏张量阵列增强在线错误检测能力,以应对随机的硬件故障。算法级容错技术已被提出作为一种低成本机制,用于在线验证计算结果是否受到随机硬件故障的影响。本文针对结构化稀疏张量阵列的关键架构挑战展开研究:如何在不同结构化稀疏度水平下提供在线错误检测,同时保持硬件的高利用率。实验结果表明,在注入随机硬件故障后,所提出的检查逻辑对执行ResNet50 CNN层的稀疏张量阵列仅产生极小的硬件开销,并展示了其错误检测特性。