Backdoor learning has become an emerging research area towards building a trustworthy machine learning system. While a lot of works have studied the hidden danger of backdoor attacks in image or text classification, there is a limited understanding of the model's robustness on backdoor attacks when the output space is infinite and discrete. In this paper, we study a much more challenging problem of testing whether sequence-to-sequence (seq2seq) models are vulnerable to backdoor attacks. Specifically, we find by only injecting 0.2\% samples of the dataset, we can cause the seq2seq model to generate the designated keyword and even the whole sentence. Furthermore, we utilize Byte Pair Encoding (BPE) to create multiple new triggers, which brings new challenges to backdoor detection since these backdoors are not static. Extensive experiments on machine translation and text summarization have been conducted to show our proposed methods could achieve over 90\% attack success rate on multiple datasets and models.
翻译:后门学习已成为构建可信机器学习系统的新兴研究领域。尽管大量工作研究了图像或文本分类中后门攻击的潜在危险,但当输出空间为无限且离散时,模型对后门攻击的鲁棒性认知仍然有限。本文研究了一个更具挑战性的问题:测试序列到序列(seq2seq)模型是否易受后门攻击。具体而言,我们发现仅注入数据集的0.2%样本,就能使seq2seq模型生成指定关键词甚至整个句子。此外,我们利用字节对编码(BPE)创建多个新触发器,由于这些后门并非静态,这为后门检测带来了新挑战。通过在机器翻译和文本摘要上的大量实验表明,我们所提出的方法在多个数据集和模型上能达到超过90%的攻击成功率。