Large Language Models (LLMs) have demonstrated exceptional performance in a variety of tasks, including essay writing and question answering. However, it is crucial to address the potential misuse of these models, which can lead to detrimental outcomes such as plagiarism and spamming. Recently, several detectors have been proposed, including fine-tuned classifiers and various statistical methods. In this study, we reveal that with the aid of carefully crafted prompts, LLMs can effectively evade these detection systems. We propose a novel Substitution-based In-Context example Optimization method (SICO) to automatically generate such prompts. On three real-world tasks where LLMs can be misused, SICO successfully enables ChatGPT to evade six existing detectors, causing a significant 0.54 AUC drop on average. Surprisingly, in most cases these detectors perform even worse than random classifiers. These results firmly reveal the vulnerability of existing detectors. Finally, the strong performance of SICO suggests itself as a reliable evaluation protocol for any new detector in this field.
翻译:大型语言模型在多种任务中展现出卓越性能,包括论文写作和问答系统。然而,亟需应对这些模型可能被滥用的风险,例如导致抄袭和垃圾信息等有害后果。近期研究者提出了多项检测方法,包括微调分类器和各类统计技术。本研究发现,通过精心设计的提示词,大型语言模型能够有效规避这些检测系统。我们提出一种创新的基于替换的上下文示例优化方法(SICO)来自动生成此类提示词。在三个可能被滥用的大型语言模型实际任务中,SICO成功使ChatGPT规避了六种现有检测器,平均导致AUC下降0.54。令人惊讶的是,在多数情况下这些检测器的表现甚至不如随机分类器。这些结果充分揭示了现有检测器的脆弱性。最后,SICO的优异表现使其可作为该领域任何新型检测器的可靠评估基准。