Large Language Models (LLMs) have demonstrated exceptional performance in a variety of tasks, including essay writing and question answering. However, it is crucial to address the potential misuse of these models, which can lead to detrimental outcomes such as plagiarism and spamming. Recently, several detectors have been proposed, including fine-tuned classifiers and various statistical methods. In this study, we reveal that with the aid of carefully crafted prompts, LLMs can effectively evade these detection systems. We propose a novel Substitution-based In-Context example Optimization method (SICO) to automatically generate such prompts. On three real-world tasks where LLMs can be misused, SICO successfully enables ChatGPT to evade six existing detectors, causing a significant 0.54 AUC drop on average. Surprisingly, in most cases these detectors perform even worse than random classifiers. These results firmly reveal the vulnerability of existing detectors. Finally, the strong performance of SICO suggests itself as a reliable evaluation protocol for any new detector in this field.
翻译:大语言模型(LLMs)在包括论文写作和问答在内的多种任务中展现出卓越性能。然而,必须正视这些模型可能被滥用的风险,例如引发抄袭和垃圾信息等有害后果。近期,研究者提出了多种检测器,包括微调分类器及各类统计方法。本研究表明,通过精心设计的提示词,LLMs能够有效规避这些检测系统。我们提出了一种新型基于替换的上下文示例优化方法(Substitution-based In-Context example Optimization, SICO),可自动生成此类提示词。在三个可能被LLMs滥用的真实任务中,SICO成功使ChatGPT规避了六种现有检测器,导致其AUC平均下降0.54。令人惊讶的是,在多数情况下这些检测器的表现甚至劣于随机分类器。这些结果充分揭示了现有检测器的脆弱性。最后,SICO的优异表现使其可作为该领域任何新型检测器的可靠评估基准。