In this paper, both empirically and theoretically, we show that several AI-text detectors are not reliable in practical scenarios. Empirically, we show that paraphrasing attacks, where a light paraphraser is applied on top of a large language model (LLM), can break a whole range of detectors, including ones using watermarking schemes as well as neural network-based detectors and zero-shot classifiers. Our experiments demonstrate that retrieval-based detectors, designed to evade paraphrasing attacks, are still vulnerable to recursive paraphrasing. We then provide a theoretical impossibility result indicating that as language models become more sophisticated and better at emulating human text, the performance of even the best-possible detector decreases. For a sufficiently advanced language model seeking to imitate human text, even the best-possible detector may only perform marginally better than a random classifier. Our result is general enough to capture specific scenarios such as particular writing styles, clever prompt design, or text paraphrasing. We also extend the impossibility result to include the case where pseudorandom number generators are used for AI-text generation instead of true randomness. We show that the same result holds with a negligible correction term for all polynomial-time computable detectors. Finally, we show that even LLMs protected by watermarking schemes can be vulnerable against spoofing attacks where adversarial humans can infer hidden LLM text signatures and add them to human-generated text to be detected as text generated by the LLMs, potentially causing reputational damage to their developers. We believe these results can open an honest conversation in the community regarding the ethical and reliable use of AI-generated text.
翻译:本文通过实证与理论分析表明,多种AI文本检测器在实际场景中并不可靠。在实证方面,我们证明通过在大型语言模型(LLM)之上施加轻量级改写攻击,即可突破包括水印方案、神经网络检测器和零样本分类器在内的多种检测机制。实验显示,专为规避改写攻击而设计的基于检索的检测器,仍受递归改写攻击的影响。随后我们提出理论不可行性结论:当语言模型愈发复杂且更擅长模仿人类文本时,即使最优检测器的性能也会下降。对于旨在模仿人类文本的足够先进的语言模型,最优检测器的表现可能仅略优于随机分类器。该结论具有普适性,可涵盖特定写作风格、巧妙提示设计或文本改写等具体场景。我们将此不可行性结论扩展至使用伪随机数生成器(而非真随机性)生成AI文本的情况,证明对所有多项式时间可计算的检测器而言,该结论仍成立且仅存在可忽略的修正项。最后,我们揭示即便受水印方案保护的LLM也可能遭受欺骗攻击:对抗性人类可推断隐式LLM文本签名并将其植入人类生成文本中,使后者被检测为LLM生成内容,可能损害开发者声誉。我们认为这些结论可推动学界就AI生成文本的伦理与可靠使用展开真诚对话。