Recent advances in large language models (LLMs) and the intensifying popularity of ChatGPT-like applications have blurred the boundary of high-quality text generation between humans and machines. However, in addition to the anticipated revolutionary changes to our technology and society, the difficulty of distinguishing LLM-generated texts (AI-text) from human-generated texts poses new challenges of misuse and fairness, such as fake content generation, plagiarism, and false accusations of innocent writers. While existing works show that current AI-text detectors are not robust to LLM-based paraphrasing, this paper aims to bridge this gap by proposing a new framework called RADAR, which jointly trains a robust AI-text detector via adversarial learning. RADAR is based on adversarial training of a paraphraser and a detector. The paraphraser's goal is to generate realistic content to evade AI-text detection. RADAR uses the feedback from the detector to update the paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly 2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, experimental results show that RADAR significantly outperforms existing AI-text detection methods, especially when paraphrasing is in place. We also identify the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, and evaluate the improved capability of RADAR via GPT-3.5-Turbo.
翻译:近年来,大语言模型的进步与ChatGPT类应用的日益普及模糊了人类与机器在高质量文本生成方面的界限。然而,除了预期给技术与社会带来的革命性变革外,区分大语言模型生成的文本(AI文本)与人类生成文本的困难也引发了虚假内容生成、剽窃以及对无辜作者的错误指控等滥用与公平性挑战。现有研究表明,当前AI文本检测器对基于大语言模型的释义改写缺乏鲁棒性。本文旨在通过提出名为RADAR的新框架弥补这一缺陷,该框架通过对抗学习联合训练鲁棒性AI文本检测器。RADAR基于释义器与检测器的对抗训练:释义器的目标是生成能规避AI文本检测的真实内容,RADAR利用检测器的反馈更新释义器,反之亦然。基于8种不同大语言模型(Pythia、Dolly 2.0、Palmyra、Camel、GPT-J、Dolly 1.0、LLaMA、Vicuna)及4个数据集的评估结果表明,RADAR显著优于现有AI文本检测方法,尤其在引入释义改写时表现突出。我们还发现RADAR从指令微调大语言模型到其他大语言模型的强迁移能力,并通过GPT-3.5-Turbo验证了RADAR性能的进一步提升。