AtPatch：通过热修复过度注意力机制调试Transformer模型 (AtPatch: Debugging Transformers via Hot-Fixing Over-Attention)

Transformer-based deep neural networks (DNNs) affected by backdoor attacks and unfairness typically exhibit anomalous attention patterns, leading to over-attend to backdoor triggers or protected attributes. Existing neuron-editing mitigation strategies often struggle to handle such situation and most of them lack flexibility and tend to distort feature representations. Motivated by such over-attention phenomenon and software engineering paradigms such as delta debugging and hot patching, we propose AtPatch, a hot-fix method that dynamically redistributes attention maps during model inference. Specifically, for a given input, AtPatch first extracts the attention map from the model's inference process. Then, it uses a pre-trained detector to identify anomalous columns and replace them with unified benign attention. Then, AtPatch rescales other columns to mitigate the impact of over-attention. Finally, AtPatch returns the redistributed attention map to the model for continued inference. Notably, if the detector does not report any anomalous columns, AtPatch directly returns the original attention map to the model. Unlike existing techniques, AtPatch selectively redistributes the attention map, making it better at preserving the model's original functionality. Furthermore, AtPatch's on-the-fly nature allows it to work without modifying model parameters or retraining, making it better suited for deployed models. We conducted extensive experiments to validate AtPatch. Experimental results show that, compared to existing methods, AtPatch can more effectively mitigate backdoor attacks and unfairness while better preserving the model's original functionality.

翻译：受后门攻击和不公平性影响的基于Transformer的深度神经网络通常表现出异常的注意力模式，导致对后门触发器或受保护属性的过度关注。现有的神经元编辑缓解策略往往难以处理此类情况，且大多缺乏灵活性，容易扭曲特征表示。受这种过度注意力现象以及增量调试和热修补等软件工程范式的启发，我们提出了AtPatch——一种在模型推理过程中动态重新分配注意力图的热修复方法。具体而言，对于给定输入，AtPatch首先从模型推理过程中提取注意力图；随后使用预训练的检测器识别异常列，并将其替换为统一的良性注意力；接着重新缩放其他列以减轻过度注意力的影响；最后将重新分配的注意力图返回模型以继续推理。值得注意的是，若检测器未报告任何异常列，AtPatch会将原始注意力图直接返回模型。与现有技术不同，AtPatch选择性重分配注意力图的特性使其能更好地保持模型原始功能。此外，AtPatch的即时处理特性使其无需修改模型参数或重新训练即可工作，更适用于已部署模型。我们通过大量实验验证AtPatch的有效性，实验结果表明：相较于现有方法，AtPatch能更有效地缓解后门攻击和不公平性问题，同时更好地保持模型的原始功能。