Unsupervised Corpus Poisoning Attacks in Continuous Space for Dense Retrieval

This paper concerns corpus poisoning attacks in dense information retrieval, where an adversary attempts to compromise the ranking performance of a search algorithm by injecting a small number of maliciously generated documents into the corpus. Our work addresses two limitations in the current literature. First, attacks that perform adversarial gradient-based word substitution search do so in the discrete lexical space, while retrieval itself happens in the continuous embedding space. We thus propose an optimization method that operates in the embedding space directly. Specifically, we train a perturbation model with the objective of maintaining the geometric distance between the original and adversarial document embeddings, while also maximizing the token-level dissimilarity between the original and adversarial documents. Second, it is common for related work to have a strong assumption that the adversary has prior knowledge about the queries. In this paper, we focus on a more challenging variant of the problem where the adversary assumes no prior knowledge about the query distribution (hence, unsupervised). Our core contribution is an adversarial corpus attack that is fast and effective. We present comprehensive experimental results on both in- and out-of-domain datasets, focusing on two related tasks: a top-1 attack and a corpus poisoning attack. We consider attacks under both a white-box and a black-box setting. Notably, our method can generate successful adversarial examples in under two minutes per target document; four times faster compared to the fastest gradient-based word substitution methods in the literature with the same hardware. Furthermore, our adversarial generation method generates text that is more likely to occur under the distribution of natural text (low perplexity), and is therefore more difficult to detect.

翻译：本文研究密集信息检索中的语料库投毒攻击，即攻击者试图通过向语料库中注入少量恶意生成的文档来破坏搜索算法的排序性能。我们的工作解决了当前文献中的两个局限性。首先，现有基于对抗性梯度的词语替换搜索攻击在离散的词法空间中进行，而检索本身发生在连续的嵌入空间。因此，我们提出了一种直接在嵌入空间中进行优化的方法。具体而言，我们训练一个扰动模型，其目标是在保持原始文档嵌入与对抗文档嵌入之间几何距离的同时，最大化原始文档与对抗文档在词元层面的差异性。其次，现有研究通常强假设攻击者具有查询的先验知识。本文聚焦于该问题更具挑战性的变体：攻击者假设对查询分布没有先验知识（即无监督场景）。我们的核心贡献是提出了一种快速高效的对抗性语料库攻击方法。我们在领域内和领域外数据集上进行了全面的实验，重点关注两个相关任务：top-1攻击和语料库投毒攻击。我们同时考虑了白盒与黑盒设置下的攻击场景。值得注意的是，我们的方法能在每篇目标文档两分钟内生成成功的对抗样本，相比文献中最快的基于梯度的词语替换方法（相同硬件条件下）提速四倍。此外，我们的对抗文本生成方法所产生的文本更符合自然文本的分布规律（低困惑度），因而更难以被检测。