The rise in malicious usage of large language models, such as fake content creation and academic plagiarism, has motivated the development of approaches that identify AI-generated text, including those based on watermarking or outlier detection. However, the robustness of these detection algorithms to paraphrases of AI-generated text remains unclear. To stress test these detectors, we build a 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering. Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings while only classifying 1% of human-written sequences as AI-generated. We open-source our models, code and data.
翻译:大型语言模型在虚假内容生成和学术抄袭等恶意用途上的增加,推动了识别AI生成文本方法的发展,包括基于水印或异常检测的技术。然而,这些检测算法对AI生成文本的释义改写版本的鲁棒性尚不明确。为对这些检测器进行压力测试,我们构建了一个含110亿参数的释义生成模型(DIPPER),该模型可对段落进行释义、结合上下文控制、调节词汇多样性及内容重排序。使用DIPPER对三个大型语言模型(含GPT3.5-davinci-003)生成的文本进行释义改写后,成功规避了多项检测器,包括水印检测、GPTZero、DetectGPT及OpenAI文本分类器。例如,在保持1%恒定假阳性率的条件下,DIPPER将DetectGPT的检测准确率从70.3%降至4.6%,且未显著改变输入语义。为提升AI生成文本检测对释义攻击的鲁棒性,我们提出了一种简单防御策略——通过检索语义相似的生成结果实现,该方案需由语言模型API提供商维护。对于候选文本,我们的算法在API先前生成的序列数据库中搜索,寻找与候选文本匹配度超过阈值的序列。我们使用基于微调T5-XXL模型生成的1500万条序列数据库进行实证验证,发现该策略在不同设置下可检测出80%至97%的释义改写生成文本,同时仅将1%的人类撰写文本误判为AI生成。我们已开源模型、代码及数据。