Despite their promising performance across various natural language processing (NLP) tasks, current NLP systems are vulnerable to textual adversarial attacks. To defend against these attacks, most existing methods apply adversarial training by incorporating adversarial examples. However, these methods have to rely on ground-truth labels to generate adversarial examples, rendering it impractical for large-scale model pre-training which is commonly used nowadays for NLP and many other tasks. In this paper, we propose a novel learning framework called SCAT (Self-supervised Contrastive Learning via Adversarial Training), which can learn robust representations without requiring labeled data. Specifically, SCAT modifies random augmentations of the data in a fully labelfree manner to generate adversarial examples. Adversarial training is achieved by minimizing the contrastive loss between the augmentations and their adversarial counterparts. We evaluate SCAT on two text classification datasets using two state-of-the-art attack schemes proposed recently. Our results show that SCAT can not only train robust language models from scratch, but it can also significantly improve the robustness of existing pre-trained language models. Moreover, to demonstrate its flexibility, we show that SCAT can also be combined with supervised adversarial training to further enhance model robustness.
翻译:尽管当前自然语言处理(NLP)系统在各类任务中表现出色,但它们容易受到文本对抗攻击的威胁。为防御此类攻击,现有方法大多通过引入对抗样本进行对抗训练。然而,这些方法需依赖真实标签生成对抗样本,使得其难以应用于当前NLP及其他领域广泛使用的大规模模型预训练阶段。本文提出一种名为SCAT(通过对抗训练的自监督对比学习)的新型学习框架,可在无标注数据的情况下学习鲁棒表示。具体而言,SCAT以完全无标签方式修改数据的随机增广来生成对抗样本,并通过最小化增广样本与其对抗对应样本之间的对比损失来实现对抗训练。我们使用两种最新提出的先进攻击方案,在两个文本分类数据集上评估SCAT。结果表明,SCAT不仅能从零开始训练鲁棒语言模型,还能显著提升现有预训练语言模型的鲁棒性。此外,为展示其灵活性,我们证明SCAT还可与监督对抗训练结合,进一步增强模型鲁棒性。