The proliferation of large language models has raised growing concerns about their misuse, particularly in cases where AI-generated text is falsely attributed to human authors. Machine-generated content detectors claim to effectively identify such text under various conditions and from any language model. This paper critically evaluates these claims by assessing several popular detectors (RADAR, Wild, T5Sentinel, Fast-DetectGPT, GPTID, LogRank, Binoculars) on a range of domains, datasets, and models that these detectors have not previously encountered. We employ various prompting strategies to simulate adversarial attacks, demonstrating that even moderate efforts can significantly evade detection. We emphasize the importance of the true positive rate at a specific false positive rate (TPR@FPR) metric and demonstrate that these detectors perform poorly in certain settings, with TPR@.01 as low as 0\%. Our findings suggest that both trained and zero-shot detectors struggle to maintain high sensitivity while achieving a reasonable true positive rate.
翻译:随着大型语言模型的普及,其滥用问题日益引发关注,尤其是在AI生成文本被错误归因于人类作者的情况下。机器生成内容检测器声称能够在各种条件下有效识别来自任何语言模型的此类文本。本文通过在一系列这些检测器先前未接触过的领域、数据集和模型上评估多种流行检测器(RADAR、Wild、T5Sentinel、Fast-DetectGPT、GPTID、LogRank、Binoculars),对这些声明进行了批判性检验。我们采用多种提示策略模拟对抗攻击,证明即使中等程度的规避努力也能显著逃逸检测。我们强调了特定误报率下的真阳性率(TPR@FPR)指标的重要性,并证明这些检测器在特定场景下表现不佳,其TPR@.01指标可低至0%。研究结果表明,无论是经过训练的检测器还是零样本检测器,都难以在保持高灵敏度的同时实现合理的真阳性率。