The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts (MGTs) and blurred text authorship in various domains. However, most existing MGT benchmarks include single-author texts (human-written and machine-generated). This conventional design fails to capture more practical multi-author scenarios, where the user refines the LLM response for natural flow, coherence, and factual correctness. Our paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by experts for various use cases, ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, allowing for diverse MGT detection evaluation across various edit types. We document Beemo's creation protocol and present the results of benchmarking 33 configurations of MGT detectors in different experimental setups. We find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Beemo and all materials are publicly available.
翻译:大型语言模型(LLMs)的快速扩散增加了机器生成文本(MGTs)的数量,并在各个领域模糊了文本的作者归属。然而,现有的大多数MGT基准仅包含单一作者文本(人类撰写和机器生成)。这种传统设计未能捕捉更实际的多作者场景,即用户为追求自然流畅性、连贯性和事实准确性而对LLM响应进行优化。本文提出了专家编辑的机器生成文本基准(Beemo),其中包含6.5k篇由人类撰写、十种指令微调LLMs生成、并由专家针对从创意写作到摘要生成等多种用例进行编辑的文本。Beemo还额外包含13.1k篇机器生成并经LLM编辑的文本,从而支持跨多种编辑类型的多样化MGT检测评估。我们记录了Beemo的创建流程,并展示了33种MGT检测器在不同实验设置下的基准测试结果。我们发现,基于专家的编辑能够规避MGT检测,而LLM编辑的文本则不太可能被识别为人类撰写。Beemo及相关材料均已公开提供。