CUDRT: Benchmarking the Detection Models of Human vs. Large Language Models Generated Texts

While large language models (LLMs) have greatly enhanced text generation across industries, their human-like outputs make distinguishing between human and AI authorship challenging. Although many LLM-generated text detectors exist, current benchmarks mainly rely on static datasets, limiting their effectiveness in assessing model-based detectors requiring prior training. Furthermore, these benchmarks focus on specific scenarios like question answering and text refinement and are primarily limited to English, overlooking broader linguistic applications and LLM subtleties. To address these gaps, we construct a comprehensive bilingual benchmark in Chinese and English to rigorously evaluate mainstream LLM-generated text detection methods. We categorize LLM text generation into five key operations-Create, Update, Delete, Rewrite, and Translate (CUDRT)-covering the full range of LLM activities. For each CUDRT category, we developed extensive datasets enabling thorough assessment of detection performance, incorporating the latest mainstream LLMs for each language. We also establish a robust evaluation framework to support scalable, reproducible experiments, facilitating an in-depth analysis of how LLM operations, different LLMs, datasets, and multilingual training sets impact detector performance, particularly for model-based methods. Our extensive experiments provide critical insights for optimizing LLM-generated text detectors and suggest future directions to improve detection accuracy and generalization across diverse scenarios.Source code and dataset are available at GitHub.

翻译：尽管大型语言模型（LLMs）显著提升了各行业的文本生成能力，但其类人输出使得区分人类与AI作者身份变得困难。虽然目前存在许多LLM生成文本检测器，但现有基准主要依赖静态数据集，这限制了其在评估需要预先训练的基于模型的检测器时的有效性。此外，这些基准主要关注问答和文本润色等特定场景，且基本局限于英语，忽略了更广泛的语言应用和LLM的细微特性。为填补这些空白，我们构建了一个全面的中英双语基准，以严格评估主流的LLM生成文本检测方法。我们将LLM文本生成归纳为五个关键操作——创建、更新、删除、重写和翻译（CUDRT），涵盖了LLM活动的全部范围。针对每个CUDRT类别，我们开发了广泛的数据集，以实现对检测性能的全面评估，并纳入了每种语言的最新主流LLMs。我们还建立了一个稳健的评估框架，以支持可扩展、可复现的实验，从而深入分析LLM操作、不同LLMs、数据集以及多语言训练集如何影响检测器性能，特别是基于模型的方法。我们的大量实验为优化LLM生成文本检测器提供了关键见解，并为提高不同场景下的检测准确性和泛化能力指明了未来方向。源代码和数据集已在GitHub上公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日