The increasing prevalence of large language models (LLMs) has significantly advanced text generation, but the human-like quality of LLM outputs presents major challenges in reliably distinguishing between human-authored and LLM-generated texts. Existing detection benchmarks are constrained by their reliance on static datasets, scenario-specific tasks (e.g., question answering and text refinement), and a primary focus on English, overlooking the diverse linguistic and operational subtleties of LLMs. To address these gaps, we propose CUDRT, a comprehensive evaluation framework and bilingual benchmark in Chinese and English, categorizing LLM activities into five key operations: Create, Update, Delete, Rewrite, and Translate. CUDRT provides extensive datasets tailored to each operation, featuring outputs from state-of-the-art LLMs to assess the reliability of LLM-generated text detectors. This framework supports scalable, reproducible experiments and enables in-depth analysis of how operational diversity, multilingual training sets, and LLM architectures influence detection performance. Our extensive experiments demonstrate the framework's capacity to optimize detection systems, providing critical insights to enhance reliability, cross-linguistic adaptability, and detection accuracy. By advancing robust methodologies for identifying LLM-generated texts, this work contributes to the development of intelligent systems capable of meeting real-world multilingual detection challenges. Source code and dataset are available at GitHub.
翻译:大型语言模型(LLM)的日益普及显著推进了文本生成技术,但LLM输出所具有的类人质量,为可靠区分人工撰写文本与LLM生成文本带来了重大挑战。现有检测基准受限于对静态数据集的依赖、特定场景任务(如问答与文本润色)的局限,且主要聚焦于英语,忽视了LLM多样化的语言特性与操作细节。为弥补这些不足,我们提出CUDRT——一个综合评估框架及中英双语基准,将LLM活动归类为五项核心操作:创建(Create)、更新(Update)、删除(Delete)、重写(Rewrite)与翻译(Translate)。CUDRT为每项操作提供量身定制的大规模数据集,包含前沿LLM的输出结果,用以评估LLM生成文本检测器的可靠性。该框架支持可扩展、可复现的实验,并能深入分析操作多样性、多语言训练集及LLM架构如何影响检测性能。我们的大量实验证明了该框架在优化检测系统方面的能力,为提升检测可靠性、跨语言适应性与检测准确率提供了关键见解。通过推进识别LLM生成文本的鲁棒性方法,本研究有助于开发能够应对现实世界多语言检测挑战的智能系统。源代码与数据集已在GitHub开源。