ApproxABFT: Approximate Algorithm-Based Fault Tolerance for Vision Transformers

Vision Transformers (ViTs) with outstanding performance becomes a popular backbone of deep learning models for the main-stream vision tasks including classification, object detection, and segmentation. Other than the performance, reliability is also a critical metric for the adoption of ViTs in safety-critical applications such as autonomous driving and robotics. With the observation that the major computing blocks in ViTs such as multi-head attention and feed forward are usually performed with general matrix multiplication (GEMM), we propose a classical algorithm-based fault tolerance (ABFT) strategy originally developed for GEMM to protect ViTs against soft errors in the underlying computing engines. Unlike classical ABFT that will invoke the expensive error recovery procedure whenever computing errors are detected, we leverage the inherent fault-tolerance of ViTs and propose an approximate ABFT, namely ApproxABFT, to invoke the error recovery procedure only when the computing errors are significant enough, which skips many useless error recovery procedures and simplifies the overall GEMM error recovery. Meanwhile, it also relaxes the error threshold in error recovery procedure and ignores minor computing errors, which reduces the error recovery complexity and improves the error recovery quality. In addition, we also apply a fine-grained blocking strategy to ApproxABFT and split GEMM with distinct sizes into smaller sub blocks such that it can smooth the error thresholds across ViTs and further improve the error recovery quality. According to our experiments, the ApproxABFT reduces the computing overhead by 25.92\% to 81.62\% and improves the model accuracy by 2.63\% to 72.56\% compared to the baseline ABFT while the blocking optimization further reduces the computing overhead by 6.56\% to 73.5\% with comparable accuracy.

翻译：视觉Transformer（ViT）凭借其卓越性能，已成为分类、目标检测与分割等主流视觉任务中深度学习模型的核心骨干。除性能外，可靠性也是其在自动驾驶与机器人等安全关键领域应用的重要指标。基于ViT中多头注意力和前馈网络等主要计算模块通常采用通用矩阵乘法（GEMM）的观察，本文提出将经典算法容错（ABFT）策略（最初为GEMM设计）应用于保护ViT免受底层计算引擎中软错误的影响。与经典ABFT在检测到计算错误时立即执行高开销错误恢复流程不同，我们利用ViT的固有容错特性，提出近似算法容错方法（ApproxABFT）：仅在计算误差显著时触发错误恢复流程，从而跳过大量无效恢复操作并简化整体GEMM错误恢复过程。同时，该方法放松错误恢复流程中的误差阈值，忽略微小计算误差，既降低错误恢复复杂度又提升恢复质量。此外，我们还在ApproxABFT中采用细粒度分块策略，将不同规模的GEMM拆分为更小子块，从而平滑ViT中的误差阈值并进一步提高错误恢复质量。实验结果表明，与基线ABFT相比，ApproxABFT计算开销降低25.92%至81.62%，模型准确率提升2.63%至72.56%；结合分块优化后，在保持可比准确率的前提下，计算开销进一步降低6.56%至73.5%。