Pre-trained transformers are popular in state-of-the-art dialogue generation (DG) systems. Such language models are, however, vulnerable to various adversarial samples as studied in traditional tasks such as text classification, which inspires our curiosity about their robustness in DG systems. One main challenge of attacking DG models is that perturbations on the current sentence can hardly degrade the response accuracy because the unchanged chat histories are also considered for decision-making. Instead of merely pursuing pitfalls of performance metrics such as BLEU, ROUGE, we observe that crafting adversarial samples to force longer generation outputs benefits attack effectiveness -- the generated responses are typically irrelevant, lengthy, and repetitive. To this end, we propose a white-box multi-objective attack method called DGSlow. Specifically, DGSlow balances two objectives -- generation accuracy and length, via a gradient-based multi-objective optimizer and applies an adaptive searching mechanism to iteratively craft adversarial samples with only a few modifications. Comprehensive experiments on four benchmark datasets demonstrate that DGSlow could significantly degrade state-of-the-art DG models with a higher success rate than traditional accuracy-based methods. Besides, our crafted sentences also exhibit strong transferability in attacking other models.
翻译:预训练Transformer模型在目前最先进的对话生成系统中广泛使用。然而,正如文本分类等传统任务中所研究的,这类语言模型容易受到各种对抗样本的影响,这引发了我们对对话生成系统中语言模型鲁棒性的好奇。攻击对话生成模型的一个主要挑战是,对当前句子的扰动难以降低响应的准确性,因为决策时还会考虑未改变的聊天历史。我们观察到,与其仅仅追求BLEU、ROUGE等性能指标的下降,不如构造迫使生成更长输出的对抗样本,这有助于提高攻击有效性——生成的响应通常不相关、冗长且重复。为此,我们提出了一种名为DGSlow的白盒多目标攻击方法。具体而言,DGSlow通过基于梯度的多目标优化器平衡生成准确性和长度这两个目标,并采用自适应搜索机制,以仅需少量修改就能迭代构造对抗样本。在四个基准数据集上的全面实验表明,DGSlow能够显著降低最先进的对话生成模型性能,且成功率高于传统的基于准确率的方法。此外,我们构造的句子在攻击其他模型时也表现出很强的可迁移性。