Camouflage is all you need: Evaluating and Enhancing Language Model Robustness Against Camouflage Adversarial Attacks

Adversarial attacks represent a substantial challenge in Natural Language Processing (NLP). This study undertakes a systematic exploration of this challenge in two distinct phases: vulnerability evaluation and resilience enhancement of Transformer-based models under adversarial attacks. In the evaluation phase, we assess the susceptibility of three Transformer configurations, encoder-decoder, encoder-only, and decoder-only setups, to adversarial attacks of escalating complexity across datasets containing offensive language and misinformation. Encoder-only models manifest a 14% and 21% performance drop in offensive language detection and misinformation detection tasks, respectively. Decoder-only models register a 16% decrease in both tasks, while encoder-decoder models exhibit a maximum performance drop of 14% and 26% in the respective tasks. The resilience-enhancement phase employs adversarial training, integrating pre-camouflaged and dynamically altered data. This approach effectively reduces the performance drop in encoder-only models to an average of 5% in offensive language detection and 2% in misinformation detection tasks. Decoder-only models, occasionally exceeding original performance, limit the performance drop to 7% and 2% in the respective tasks. Although not surpassing the original performance, Encoder-decoder models can reduce the drop to an average of 6% and 2% respectively. Results suggest a trade-off between performance and robustness, with some models maintaining similar performance while gaining robustness. Our study and adversarial training techniques have been incorporated into an open-source tool for generating camouflaged datasets. However, methodology effectiveness depends on the specific camouflage technique and data encountered, emphasizing the need for continued exploration.

翻译：对抗攻击是自然语言处理（NLP）领域面临的重大挑战。本研究分两个阶段系统探索该挑战：基于Transformer的模型在对抗攻击下的脆弱性评估与鲁棒性增强。在评估阶段，我们考察了三种Transformer架构（编码器-解码器、仅编码器、仅解码器）在包含攻击性语言和虚假信息的数据集中，对复杂度递增的对抗攻击的敏感性。仅编码器模型在攻击性语言检测和虚假信息检测任务中性能分别下降14%和21%；仅解码器模型两项任务均下降16%；编码器-解码器模型在相应任务中最大性能下降幅度分别为14%和26%。鲁棒性增强阶段采用对抗训练，融合预伪装与动态扰动数据。该方法将仅编码器模型的性能下降有效控制在攻击性语言检测平均5%、虚假信息检测平均2%；仅解码器模型偶尔超越原始性能，将两项任务性能下降限制在7%和2%；编码器-解码器模型虽未超越原始性能，但可将下降幅度分别降至平均6%和2%。结果表明性能与鲁棒性存在权衡，部分模型在维持相近性能的同时提升了鲁棒性。本研究及对抗训练技术已整合至开源工具中用于生成伪装数据集。然而方法有效性取决于具体伪装技术及实际数据，凸显持续探索的必要性。