Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense

The rise in malicious usage of large language models, such as fake content creation and academic plagiarism, has motivated the development of approaches that identify AI-generated text, including those based on watermarking or outlier detection. However, the robustness of these detection algorithms to paraphrases of AI-generated text remains unclear. To stress test these detectors, we build a 11B parameter paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering. Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking, GPTZero, DetectGPT, and OpenAI's text classifier. For example, DIPPER drops detection accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. To increase the robustness of AI-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider. Given a candidate text, our algorithm searches a database of sequences previously generated by the API, looking for sequences that match the candidate text within a certain threshold. We empirically verify our defense using a database of 15M generations from a fine-tuned T5-XXL model and find that it can detect 80% to 97% of paraphrased generations across different settings while only classifying 1% of human-written sequences as AI-generated. We open-source our models, code and data.

翻译：大型语言模型在虚假内容生成和学术抄袭等恶意用途上的增加，推动了识别AI生成文本方法的发展，包括基于水印或异常检测的技术。然而，这些检测算法对AI生成文本的释义改写版本的鲁棒性尚不明确。为对这些检测器进行压力测试，我们构建了一个含110亿参数的释义生成模型（DIPPER），该模型可对段落进行释义、结合上下文控制、调节词汇多样性及内容重排序。使用DIPPER对三个大型语言模型（含GPT3.5-davinci-003）生成的文本进行释义改写后，成功规避了多项检测器，包括水印检测、GPTZero、DetectGPT及OpenAI文本分类器。例如，在保持1%恒定假阳性率的条件下，DIPPER将DetectGPT的检测准确率从70.3%降至4.6%，且未显著改变输入语义。为提升AI生成文本检测对释义攻击的鲁棒性，我们提出了一种简单防御策略——通过检索语义相似的生成结果实现，该方案需由语言模型API提供商维护。对于候选文本，我们的算法在API先前生成的序列数据库中搜索，寻找与候选文本匹配度超过阈值的序列。我们使用基于微调T5-XXL模型生成的1500万条序列数据库进行实证验证，发现该策略在不同设置下可检测出80%至97%的释义改写生成文本，同时仅将1%的人类撰写文本误判为AI生成。我们已开源模型、代码及数据。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日