The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results. Code is available at https://github.com/hyunseoklee-ai/ReMoDetect.
翻译:大语言模型(LLM)卓越的能力和易于访问的特性显著增加了社会风险(例如虚假新闻生成),因此需要开发LLM生成文本(LGT)检测方法以确保安全使用。然而,由于LLM数量庞大,逐一考虑每个模型并不现实,因此检测LGT具有挑战性;识别这些模型共有的特征至关重要。在本文中,我们关注近期强大LLM的一个共同特征,即对齐训练,即训练LLM生成人类偏好的文本。我们的关键发现是,由于这些对齐的LLM被训练以最大化人类偏好,它们生成的文本甚至比人类撰写的文本具有更高的估计偏好;因此,使用奖励模型(即训练用于建模人类偏好分布的LLM)可以轻松检测此类文本。基于这一发现,我们提出了两种训练方案以进一步提升奖励模型的检测能力,即(i)持续偏好微调,使奖励模型进一步偏好对齐的LGT,以及(ii)人类/LLM混合文本(使用对齐LLM对人类撰写文本进行重述的文本)的奖励建模,其作为LGT与人类撰写文本之间的中等偏好文本语料库,以更好地学习决策边界。我们通过考虑十二个对齐LLM在六个文本领域进行了广泛评估,我们的方法展示了最先进的结果。代码可在 https://github.com/hyunseoklee-ai/ReMoDetect 获取。