The widespread use of large language models (LLMs) has sparked concerns about the potential misuse of AI-generated text, as these models can produce content that closely resembles human-generated text. Current detectors for AI-generated text (AIGT) lack robustness against adversarial perturbations, with even minor changes in characters or words causing a reversal in distinguishing between human-created and AI-generated text. This paper investigates the robustness of existing AIGT detection methods and introduces a novel detector, the Siamese Calibrated Reconstruction Network (SCRN). The SCRN employs a reconstruction network to add and remove noise from text, extracting a semantic representation that is robust to local perturbations. We also propose a siamese calibration technique to train the model to make equally confidence predictions under different noise, which improves the model's robustness against adversarial perturbations. Experiments on four publicly available datasets show that the SCRN outperforms all baseline methods, achieving 6.5\%-18.25\% absolute accuracy improvement over the best baseline method under adversarial attacks. Moreover, it exhibits superior generalizability in cross-domain, cross-genre, and mixed-source scenarios. The code is available at \url{https://github.com/CarlanLark/Robust-AIGC-Detector}.
翻译:大型语言模型(LLM)的广泛使用引发了人们对AI生成文本潜在滥用的担忧,因为这些模型能够生成与人类撰写文本高度相似的内容。现有的AI生成文本(AIGT)检测器在面对对抗性扰动时缺乏鲁棒性,即使字符或单词的微小改动也可能导致人类创作文本与AI生成文本的判别结果逆转。本文研究了现有AIGT检测方法的鲁棒性,并提出了一种新型检测器——孪生校准重构网络(SCRN)。该网络通过重构网络对文本添加和去除噪声,提取对局部扰动具有鲁棒性的语义表征。我们还提出了一种孪生校准技术,使模型在不同噪声条件下能做出置信度均衡的预测,从而提升模型对抗对抗性扰动的鲁棒性。在四个公开数据集上的实验表明,SCRN在所有基线方法中表现最优,在对抗攻击下相较最佳基线方法实现了6.5%–18.25%的绝对准确率提升。此外,该模型在跨领域、跨体裁及混合来源场景中展现出卓越的泛化能力。代码已发布于\url{https://github.com/CarlanLark/Robust-AIGC-Detector}。