Improving the Robustness of Transformer-based Large Language Models with Dynamic Attention

Transformer-based models, such as BERT and GPT, have been widely adopted in natural language processing (NLP) due to their exceptional performance. However, recent studies show their vulnerability to textual adversarial attacks where the model's output can be misled by intentionally manipulating the text inputs. Despite various methods that have been proposed to enhance the model's robustness and mitigate this vulnerability, many require heavy consumption resources (e.g., adversarial training) or only provide limited protection (e.g., defensive dropout). In this paper, we propose a novel method called dynamic attention, tailored for the transformer architecture, to enhance the inherent robustness of the model itself against various adversarial attacks. Our method requires no downstream task knowledge and does not incur additional costs. The proposed dynamic attention consists of two modules: (I) attention rectification, which masks or weakens the attention value of the chosen tokens, and (ii) dynamic modeling, which dynamically builds the set of candidate tokens. Extensive experiments demonstrate that dynamic attention significantly mitigates the impact of adversarial attacks, improving up to 33\% better performance than previous methods against widely-used adversarial attacks. The model-level design of dynamic attention enables it to be easily combined with other defense methods (e.g., adversarial training) to further enhance the model's robustness. Furthermore, we demonstrate that dynamic attention preserves the state-of-the-art robustness space of the original model compared to other dynamic modeling methods.

翻译：基于Transformer架构的模型（如BERT和GPT）凭借其卓越性能被广泛应用于自然语言处理领域。然而，近期研究表明此类模型存在对文本对抗攻击的脆弱性——攻击者可通过有意图地修改文本输入误导模型输出。尽管已有多种方法被提出用于增强模型鲁棒性并缓解这一缺陷，但许多方法要么需要消耗大量计算资源（如对抗训练），要么仅能提供有限保护（如防御性随机失活）。本文针对Transformer架构提出一种名为动态注意力的新型方法，旨在从模型自身增强其对抗各类对抗攻击的固有能力。该方法无需下游任务知识，且不产生额外计算开销。动态注意力包含两个模块：（I）注意力修正——遮蔽或削弱选定标记的注意力值；（II）动态建模——动态构建候选标记集合。大量实验表明，动态注意力显著缓解了对抗攻击的影响，在应对广泛使用的对抗攻击时，性能较先前方法最高提升33%。该方法采用模型级设计，可便捷地与其它防御手段（如对抗训练）结合以进一步增强模型鲁棒性。此外，与其它动态建模方法相比，动态注意力保持了原模型在鲁棒性空间上的最优水平。