The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results. Code is available at https://github.com/hyunseoklee-ai/reward_llm_detect.
翻译:大型语言模型(LLM)的卓越能力和易用性显著增加了社会风险(例如虚假新闻生成),因此需要开发LLM生成文本(LGT)检测方法以确保安全使用。然而,由于LLM数量庞大,单独考虑每个模型并不现实,因此检测LGT具有挑战性;识别这些模型共有的共同特征至关重要。本文中,我们关注近期强大LLM的一个共同特征,即对齐训练——通过训练LLM生成更符合人类偏好的文本。我们的核心发现是:由于这些对齐LLM被训练以最大化人类偏好,它们生成的文本甚至比人类撰写的文本具有更高的估计偏好值;因此,这类文本可以通过奖励模型(即训练用于建模人类偏好分布的LLM)轻松检测。基于这一发现,我们提出了两种训练方案以进一步提升奖励模型的检测能力:(i)持续偏好微调,使奖励模型进一步偏好对齐的LGT;(ii)人类/LLM混合文本(使用对齐LLM对人类撰写文本进行重述得到的文本)的奖励建模,该混合文本作为LGT与人类撰写文本之间的中等偏好语料库,有助于更好地学习决策边界。我们通过在十二个对齐LLM上涵盖六个文本领域进行广泛评估,结果表明我们的方法取得了最先进的性能。代码发布于https://github.com/hyunseoklee-ai/reward_llm_detect。