Large Language Models (LLMs) have demonstrated exceptional performance in a variety of tasks, including essay writing and question answering. However, it is crucial to address the potential misuse of these models, which can lead to detrimental outcomes such as plagiarism and spamming. Recently, several detectors have been proposed, including fine-tuned classifiers and various statistical methods. In this study, we reveal that with the aid of carefully crafted prompts, LLMs can effectively evade these detection systems. We propose a novel Substitution-based In-Context example Optimization method (SICO) to automatically generate such prompts. On three real-world tasks where LLMs can be misused, SICO successfully enables ChatGPT to evade six existing detectors, causing a significant 0.54 AUC drop on average. Surprisingly, in most cases these detectors perform even worse than random classifiers. These results firmly reveal the vulnerability of existing detectors. Finally, the strong performance of SICO suggests itself as a reliable evaluation protocol for any new detector in this field.
翻译:大型语言模型(LLMs)在包括论文写作和问答在内的多种任务中展现了卓越的性能。然而,必须应对这些模型可能被滥用的风险,这可能导致抄袭和垃圾信息等有害后果。近期,研究者提出了多种检测器,包括微调分类器及各类统计方法。本研究表明,借助精心设计的提示,LLMs能够有效规避这些检测系统。我们提出了一种新颖的基于替换的上下文示例优化方法(SICO),以自动生成此类提示。在LLMs可能被滥用的三个实际任务中,SICO成功使ChatGPT规避了六种现有检测器,平均导致AUC值显著下降0.54。令人惊讶的是,在大多数情况下,这些检测器的表现甚至不如随机分类器。这些结果充分揭示了现有检测器的脆弱性。最后,SICO的强劲性能表明其可作为该领域任何新检测器的可靠评估协议。