With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations -- short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts. We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (https://github.com/Reliable-Information-Lab-HEVS/benchmark_llm_texts_detection).
翻译:随着功能强大的大型语言模型(LLM)的广泛普及,由LLM生成的虚假信息已成为一个主要关切。历史上,LLM检测器曾被吹捧为解决方案,但其在现实世界中的有效性仍有待验证。本文聚焦于信息操作中的一个重要场景——由中等复杂程度的攻击者生成的短新闻式帖子。我们证明,现有的LLM检测器(无论是零样本检测器还是专门训练的检测器)在该场景下均未准备好投入实际应用。所有测试的零样本检测器均与先前基准测试的表现不一致,且极易受到采样温度升高这一简单攻击的影响(该攻击在近期基准测试中未被纳入)。虽然可以开发出能够跨LLM和未见攻击泛化的专门训练检测器,但其无法泛化到新的人类撰写文本。我们认为,前者表明需要领域特定的基准测试,而后者则揭示了对抗性规避鲁棒性与对参考人类文本过拟合之间的权衡——这两者均需在基准测试中进行评估,但目前均未涵盖。我们相信,这提示需要重新审视当前LLM检测器的基准测试方法,并为此提供了一个动态可扩展的基准测试框架以支持相关研究(https://github.com/Reliable-Information-Lab-HEVS/benchmark_llm_texts_detection)。