ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models

AI generated content (AIGC) presents considerable challenge to educators around the world. Instructors need to be able to detect such text generated by large language models, either with the naked eye or with the help of some tools. There is also growing need to understand the lexical, syntactic and stylistic features of AIGC. To address these challenges in English language teaching, we first present ArguGPT, a balanced corpus of 4,038 argumentative essays generated by 7 GPT models in response to essay prompts from three sources: (1) in-class or homework exercises, (2) TOEFL and (3) GRE writing tasks. Machine-generated texts are paired with roughly equal number of human-written essays with three score levels matched in essay prompts. We then hire English instructors to distinguish machine essays from human ones. Results show that when first exposed to machine-generated essays, the instructors only have an accuracy of 61% in detecting them. But the number rises to 67% after one round of minimal self-training. Next, we perform linguistic analyses of these essays, which show that machines produce sentences with more complex syntactic structures while human essays tend to be lexically more complex. Finally, we test existing AIGC detectors and build our own detectors using SVMs and RoBERTa. Results suggest that a RoBERTa fine-tuned with the training set of ArguGPT achieves above 90% accuracy in both essay- and sentence-level classification. To the best of our knowledge, this is the first comprehensive analysis of argumentative essays produced by generative large language models. Machine-authored essays in ArguGPT and our models will be made publicly available at https://github.com/huhailinguist/ArguGPT

翻译：AI生成内容（AIGC）对全球教育工作者构成了重大挑战。教师需要能够通过肉眼或借助工具检测出大型语言模型生成的此类文本。同时，理解AIGC的词汇、句法和文体特征的需求也日益增长。为应对英语教学中的这些挑战，我们首先构建了ArguGPT——一个包含4038篇议论文的平衡语料库，这些论文由7种GPT模型根据三个来源的写作提示生成：（1）课堂或课后练习、（2）托福考试和（3）GRE写作任务。机器生成的文本与约等数量、匹配写作提示且分为三个分数等级的人工撰写论文配对。随后，我们聘请英语教师区分机器论文与人工论文。结果表明，教师初次接触机器生成论文时的检测准确率仅为61%，但在经过一轮最小程度的自我训练后，该数值上升至67%。接着，我们对这些论文进行语言分析，发现机器生成的句子具有更复杂的句法结构，而人工论文的词汇复杂度通常更高。最后，我们测试了现有的AIGC检测器，并利用支持向量机（SVM）和RoBERTa构建了自有检测器。结果显示，在ArguGPT训练集上微调的RoBERTa在论文级和句子级分类中均能达到90%以上的准确率。据我们所知，这是首个对生成式大型语言模型所产议论文进行的全面分析。ArguGPT中的机器作者论文及其模型将公开在https://github.com/huhailinguist/ArguGPT。