While large language models (LLMs) have greatly enhanced text generation across industries, their human-like outputs make distinguishing between human and AI authorship challenging. Although many LLM-generated text detectors exist, current benchmarks mainly rely on static datasets, limiting their effectiveness in assessing model-based detectors requiring prior training. Furthermore, these benchmarks focus on specific scenarios like question answering and text refinement and are primarily limited to English, overlooking broader linguistic applications and LLM subtleties. To address these gaps, we construct a comprehensive bilingual benchmark in Chinese and English to rigorously evaluate mainstream LLM-generated text detection methods. We categorize LLM text generation into five key operations-Create, Update, Delete, Rewrite, and Translate (CUDRT)-covering the full range of LLM activities. For each CUDRT category, we developed extensive datasets enabling thorough assessment of detection performance, incorporating the latest mainstream LLMs for each language. We also establish a robust evaluation framework to support scalable, reproducible experiments, facilitating an in-depth analysis of how LLM operations, different LLMs, datasets, and multilingual training sets impact detector performance, particularly for model-based methods. Our extensive experiments provide critical insights for optimizing LLM-generated text detectors and suggest future directions to improve detection accuracy and generalization across diverse scenarios.Source code and dataset are available at GitHub.
翻译:尽管大型语言模型(LLMs)显著提升了各行业的文本生成能力,但其类人输出使得区分人类与AI作者身份变得困难。虽然目前存在许多LLM生成文本检测器,但现有基准主要依赖静态数据集,这限制了其在评估需要预先训练的基于模型的检测器时的有效性。此外,这些基准主要关注问答和文本润色等特定场景,且基本局限于英语,忽略了更广泛的语言应用和LLM的细微特性。为填补这些空白,我们构建了一个全面的中英双语基准,以严格评估主流的LLM生成文本检测方法。我们将LLM文本生成归纳为五个关键操作——创建、更新、删除、重写和翻译(CUDRT),涵盖了LLM活动的全部范围。针对每个CUDRT类别,我们开发了广泛的数据集,以实现对检测性能的全面评估,并纳入了每种语言的最新主流LLMs。我们还建立了一个稳健的评估框架,以支持可扩展、可复现的实验,从而深入分析LLM操作、不同LLMs、数据集以及多语言训练集如何影响检测器性能,特别是基于模型的方法。我们的大量实验为优化LLM生成文本检测器提供了关键见解,并为提高不同场景下的检测准确性和泛化能力指明了未来方向。源代码和数据集已在GitHub上公开。