With ChatGPT under the spotlight, utilizing large language models (LLMs) to assist academic writing has drawn a significant amount of debate in the community. In this paper, we aim to present a comprehensive study of the detectability of ChatGPT-generated content within the academic literature, particularly focusing on the abstracts of scientific papers, to offer holistic support for the future development of LLM applications and policies in academia. Specifically, we first present GPABench2, a benchmarking dataset of over 2.8 million comparative samples of human-written, GPT-written, GPT-completed, and GPT-polished abstracts of scientific writing in computer science, physics, and humanities and social sciences. Second, we explore the methodology for detecting ChatGPT content. We start by examining the unsatisfactory performance of existing ChatGPT detecting tools and the challenges faced by human evaluators (including more than 240 researchers or students). We then test the hand-crafted linguistic features models as a baseline and develop a deep neural framework named CheckGPT to better capture the subtle and deep semantic and linguistic patterns in ChatGPT written literature. Last, we conduct comprehensive experiments to validate the proposed CheckGPT framework in each benchmarking task over different disciplines. To evaluate the detectability of ChatGPT content, we conduct extensive experiments on the transferability, prompt engineering, and robustness of CheckGPT.
翻译:随着ChatGPT成为焦点,利用大型语言模型辅助学术写作已在学界引发广泛讨论。本文旨在对学术文献中ChatGPT生成内容的可检测性进行全面研究,特别聚焦于科学论文摘要,为学术界未来开发LLM应用与政策提供全方位支持。具体而言,我们首先构建GPABench2基准数据集,包含计算机科学、物理学及人文社会科学领域超过280万组对比样本,涵盖人类撰写、GPT生成、GPT补全及GPT润色四种形式的科学写作摘要。其次,我们探索ChatGPT内容检测的方法论:从现有检测工具性能欠佳与人类评估者(涵盖240余名研究人员或学生)面临的挑战出发,以手工构建的语言特征模型为基线,开发名为CheckGPT的深度神经网络框架,以更精准捕捉ChatGPT生成文献中隐含的深层语义与语言模式。最后,通过跨学科多基准任务的系统实验验证CheckGPT框架的有效性。为评估ChatGPT内容的可检测性,我们针对CheckGPT的可迁移性、提示工程设计及鲁棒性开展大规模实验研究。