Detecting LLM-Generated Text in Computing Education: A Comparative Study for ChatGPT Cases

Due to the recent improvements and wide availability of Large Language Models (LLMs), they have posed a serious threat to academic integrity in education. Modern LLM-generated text detectors attempt to combat the problem by offering educators with services to assess whether some text is LLM-generated. In this work, we have collected 124 submissions from computer science students before the creation of ChatGPT. We then generated 40 ChatGPT submissions. We used this data to evaluate eight publicly-available LLM-generated text detectors through the measures of accuracy, false positives, and resilience. The purpose of this work is to inform the community of what LLM-generated text detectors work and which do not, but also to provide insights for educators to better maintain academic integrity in their courses. Our results find that CopyLeaks is the most accurate LLM-generated text detector, GPTKit is the best LLM-generated text detector to reduce false positives, and GLTR is the most resilient LLM-generated text detector. We also express concerns over 52 false positives (of 114 human written submissions) generated by GPTZero. Finally, we note that all LLM-generated text detectors are less accurate with code, other languages (aside from English), and after the use of paraphrasing tools (like QuillBot). Modern detectors are still in need of improvements so that they can offer a full-proof solution to help maintain academic integrity. Further, their usability can be improved by facilitating a smooth API integration, providing clear documentation of their features and the understandability of their model(s), and supporting more commonly used languages.

翻译：由于大型语言模型（LLMs）近年来的改进和广泛普及，它们对教育领域的学术诚信构成了严重威胁。现代LLM生成文本检测器试图通过为教育工作者提供评估文本是否由LLM生成的服务来应对这一问题。在本研究中，我们收集了ChatGPT问世前计算机科学学生的124份作业，并生成了40份ChatGPT生成的作业。利用这些数据，我们通过准确性、假阳性率和抗干扰性三个指标，评估了八款公开可用的LLM生成文本检测器。本研究旨在告知学界哪些LLM生成文本检测器有效、哪些无效，同时为教育工作者提供更好维护课程学术诚信的见解。结果显示，CopyLeaks在准确性上表现最佳，GPTKit在减少假阳性率方面最优，而GLTR最具抗干扰性。我们还对GPTZero产生的52个假阳性案例（共114份人工撰写的作业）表示担忧。最后，我们注意到所有LLM生成文本检测器在处理代码、非英语语言以及经改写工具（如QuillBot）处理后的文本时准确性有所下降。现代检测器仍需改进，才能提供万全之策以维护学术诚信。此外，其可用性可通过简化API集成、提供清晰的功能文档及模型可解释性，以及支持更多常用语言来进一步提升。