Large language models (LLMs) can be abused at scale to create non-factual content and spread disinformation. Detecting LLM-generated content is essential to mitigate these risks, but current classifiers often fail to generalize in open-world contexts. Prior work shows that LLMs tend to rewrite LLM-generated content less frequently, which can be used for detection and naturally generalizes to unforeseen data. However, we find that the rewriting edit distance between human and LLM content can be indistinguishable across domains, leading to detection failures. We propose training an LLM to rewrite input text, producing minimal edits for LLM-generated content and more edits for human-written text, deriving a distinguishable and generalizable edit distance difference across different domains. Experiments on text from 21 independent domains and three popular LLMs (e.g., GPT-4o, Gemini, and Llama-3) show that our classifier outperforms the state-of-the-art zero-shot classifier by up to 20.6% on AUROC score and the rewriting classifier by 9.2% on F1 score. Our work suggests that LLM can effectively detect machine-generated text if they are trained properly.
翻译:大型语言模型(LLM)可能被大规模滥用以制造非事实内容并传播虚假信息。检测LLM生成的内容对于降低这些风险至关重要,但现有分类器在开放世界场景中往往泛化能力不足。先前研究表明,LLM倾向于较少改写LLM生成的内容,这一特性可用于检测任务并天然适用于未知数据。然而,我们发现人类与LLM内容之间的改写编辑距离在不同领域可能难以区分,从而导致检测失效。我们提出训练一个LLM来改写输入文本,使其对LLM生成内容产生最小编辑,而对人类撰写文本进行更多修改,从而在不同领域间产生可区分且可泛化的编辑距离差异。在21个独立领域文本及三种主流LLM(如GPT-4o、Gemini和Llama-3)上的实验表明,我们的分类器在AUROC指标上比当前最优零样本分类器提升高达20.6%,在F1分数上比改写分类器提升9.2%。我们的研究表明,经过适当训练的LLM能够有效检测机器生成文本。