With ChatGPT under the spotlight, utilizing large language models (LLMs) for academic writing has drawn a significant amount of discussions and concerns in the community. While substantial research efforts have been stimulated for detecting LLM-Generated Content (LLM-content), most of the attempts are still in the early stage of exploration. In this paper, we present a holistic investigation of detecting LLM-generate academic writing, by providing a dataset, evidence, and algorithms, in order to inspire more community effort to address the concern of LLM academic misuse. We first present GPABenchmark, a benchmarking dataset of 600,000 samples of human-written, GPT-written, GPT-completed, and GPT-polished abstracts of research papers in CS, physics, and humanities and social sciences (HSS). We show that existing open-source and commercial GPT detectors provide unsatisfactory performance on GPABenchmark, especially for GPT-polished text. Moreover, through a user study of 150+ participants, we show that it is highly challenging for human users, including experienced faculty members and researchers, to identify GPT-generated abstracts. We then present CheckGPT, a novel LLM-content detector consisting of a general representation module and an attentive-BiLSTM classification module, which is accurate, transferable, and interpretable. Experimental results show that CheckGPT achieves an average classification accuracy of 98% to 99% for the task-specific discipline-specific detectors and the unified detectors. CheckGPT is also highly transferable that, without tuning, it achieves ~90% accuracy in new domains, such as news articles, while a model tuned with approximately 2,000 samples in the target domain achieves ~98% accuracy. Finally, we demonstrate the explainability insights obtained from CheckGPT to reveal the key behaviors of how LLM generates texts.
翻译:随着ChatGPT成为焦点,利用大型语言模型进行学术写作已在学术界引发大量讨论与担忧。尽管检测大语言模型生成内容的相关研究已受到广泛关注,但多数尝试仍处于早期探索阶段。本文通过提供数据集、证据与算法,对检测LLM生成的学术写作展开系统性研究,旨在激发更多学界力量应对LLM学术滥用问题。我们首先提出GPABenchmark基准数据集,包含60万份来自计算机科学、物理学及人文社会科学领域的论文摘要样本,涵盖人工撰写、GPT撰写、GPT补全与GPT润色四种类型。研究表明,现有开源与商业GPT检测器在GPABenchmark上表现欠佳,尤其对于GPT润色文本。通过对150余名参与者进行用户研究,我们发现包括经验丰富的教师和研究人员在内的人类用户极难识别GPT生成的摘要。随后提出CheckGPT——一种由通用表示模块与注意力双向长短期记忆分类模块构成的新型LLM内容检测器,具备高精度、强迁移性与可解释性。实验结果表明,针对特定学科与统一任务的检测器,CheckGPT平均分类准确率达98%至99%。该模型迁移性优异:无需微调即可在新闻文章等新领域实现约90%准确率,而使用目标领域约2000个样本微调后准确率可达约98%。最后,我们通过CheckGPT获取的可解释性洞察,揭示了LLM生成文本的关键行为特征。