In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such "expert" annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.
翻译:本文研究了人类检测商业大语言模型(GPT-4o、Claude、o1)生成文本的能力。我们聘请标注员阅读300篇非虚构类英文文章,将其标记为人类撰写或AI生成,并为他们的判断提供段落长度的解释。我们的实验表明,频繁使用大语言模型进行写作任务的标注员在检测AI生成文本方面表现出色,即使没有任何专门训练或反馈。事实上,五位此类“专家”标注员的多数投票在300篇文章中仅误判了1篇,其表现显著优于我们评估的大多数商业和开源检测器,即使在存在规避策略(如改写和人性化处理)的情况下也是如此。对专家自由形式解释的定性分析表明,虽然他们严重依赖特定的词汇线索(“AI词汇”),但也捕捉到了文本中更复杂的现象(例如,正式性、原创性、清晰度),这些现象对于自动检测器而言难以评估。我们发布了标注数据集和代码,以推动未来在人类和自动化检测AI生成文本方面的研究。