ReMoDetect: Reward Models Recognize Aligned LLM's Generations

The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results. Code is available at https://github.com/hyunseoklee-ai/reward_llm_detect.

翻译：大型语言模型（LLM）的卓越能力和易用性显著增加了社会风险（例如虚假新闻生成），因此需要开发LLM生成文本（LGT）检测方法以确保安全使用。然而，由于LLM数量庞大，单独考虑每个模型并不现实，因此检测LGT具有挑战性；识别这些模型共有的共同特征至关重要。本文中，我们关注近期强大LLM的一个共同特征，即对齐训练——通过训练LLM生成更符合人类偏好的文本。我们的核心发现是：由于这些对齐LLM被训练以最大化人类偏好，它们生成的文本甚至比人类撰写的文本具有更高的估计偏好值；因此，这类文本可以通过奖励模型（即训练用于建模人类偏好分布的LLM）轻松检测。基于这一发现，我们提出了两种训练方案以进一步提升奖励模型的检测能力：（i）持续偏好微调，使奖励模型进一步偏好对齐的LGT；（ii）人类/LLM混合文本（使用对齐LLM对人类撰写文本进行重述得到的文本）的奖励建模，该混合文本作为LGT与人类撰写文本之间的中等偏好语料库，有助于更好地学习决策边界。我们通过在十二个对齐LLM上涵盖六个文本领域进行广泛评估，结果表明我们的方法取得了最先进的性能。代码发布于https://github.com/hyunseoklee-ai/reward_llm_detect。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日