Despite the strong performance of current NLP models, they can be brittle against adversarial attacks. To enable effective learning against adversarial inputs, we introduce the use of rationale models that can explicitly learn to ignore attack tokens. We find that the rationale models can successfully ignore over 90% of attack tokens. This approach leads to consistent sizable improvements ($\sim$10%) over baseline models in robustness on three datasets for both BERT and RoBERTa, and also reliably outperforms data augmentation with adversarial examples alone. In many cases, we find that our method is able to close the gap between model performance on a clean test set and an attacked test set and hence reduce the effect of adversarial attacks.
翻译:尽管当前NLP模型表现出色,但它们可能对对抗攻击较为脆弱。为了实现针对对抗输入的有效学习,我们引入了能够显式学习忽略攻击标记的合理性模型。我们发现,合理性模型可以成功忽略超过90%的攻击标记。该方法在三个数据集上针对BERT和RoBERTa均实现了相比基线模型一致且显著的鲁棒性提升(约10%),并且可靠地超越了仅使用对抗样本进行数据增强的性能。在许多情况下,我们的方法能够缩小模型在干净测试集和受攻击测试集上的性能差距,从而降低对抗攻击的影响。