Recent advances in large language models (LLMs) and the intensifying popularity of ChatGPT-like applications have blurred the boundary of high-quality text generation between humans and machines. However, in addition to the anticipated revolutionary changes to our technology and society, the difficulty of distinguishing LLM-generated texts (AI-text) from human-generated texts poses new challenges of misuse and fairness, such as fake content generation, plagiarism, and false accusation of innocent writers. While existing works show that current AI-text detectors are not robust to LLM-based paraphrasing, this paper aims to bridge this gap by proposing a new framework called RADAR, which jointly trains a Robust AI-text Detector via Adversarial leaRning. RADAR is based on adversarial training of a paraphraser and a detector. The paraphraser's goal is to generate realistic contents to evade AI-text detection. RADAR uses the feedback from the detector to update the paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly 2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, experimental results show that RADAR significantly outperforms existing AI-text detection methods, especially when paraphrasing is in place. We also identify the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, and evaluate the improved capability of RADAR via GPT-3.5.
翻译:近期大型语言模型的进展以及类似ChatGPT应用的日益普及,模糊了人类与机器在高质量文本生成方面的界限。然而,除了对技术和社会产生的预期革命性变革外,区分大语言模型生成的文本(AI文本)与人类生成文本的困难性,带来了虚假内容生成、抄袭及对无辜作者的不实指控等滥用与公平性问题的新挑战。尽管现有研究表明当前的AI文本检测器对基于大语言模型的改写操作缺乏鲁棒性,本文旨在通过提出名为RADAR的新框架来弥补这一不足。该框架通过对抗学习联合训练鲁棒的AI文本检测器。RADAR基于改写器与检测器的对抗训练:改写器的目标是生成能规避AI文本检测的逼真内容,并利用检测器的反馈进行参数更新,反之亦然。基于4个数据集、8种不同大语言模型(Pythia、Dolly 2.0、Palmyra、Camel、GPT-J、Dolly 1.0、LLaMA和Vicuna)的评估实验表明,RADAR显著优于现有AI文本检测方法,尤其在存在改写的情况下。我们还发现RADAR从指令微调大语言模型到其他大语言模型具有强迁移能力,并通过GPT-3.5验证了其改进后的检测能力。