ReMoDetect: Reward Models Recognize Aligned LLM's Generations

The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results. Code is available at https://github.com/hyunseoklee-ai/ReMoDetect.

翻译：大语言模型（LLM）卓越的能力和易于访问的特性显著增加了社会风险（例如虚假新闻生成），因此需要开发LLM生成文本（LGT）检测方法以确保安全使用。然而，由于LLM数量庞大，逐一考虑每个模型并不现实，因此检测LGT具有挑战性；识别这些模型共有的特征至关重要。在本文中，我们关注近期强大LLM的一个共同特征，即对齐训练，即训练LLM生成人类偏好的文本。我们的关键发现是，由于这些对齐的LLM被训练以最大化人类偏好，它们生成的文本甚至比人类撰写的文本具有更高的估计偏好；因此，使用奖励模型（即训练用于建模人类偏好分布的LLM）可以轻松检测此类文本。基于这一发现，我们提出了两种训练方案以进一步提升奖励模型的检测能力，即（i）持续偏好微调，使奖励模型进一步偏好对齐的LGT，以及（ii）人类/LLM混合文本（使用对齐LLM对人类撰写文本进行重述的文本）的奖励建模，其作为LGT与人类撰写文本之间的中等偏好文本语料库，以更好地学习决策边界。我们通过考虑十二个对齐LLM在六个文本领域进行了广泛评估，我们的方法展示了最先进的结果。代码可在 https://github.com/hyunseoklee-ai/ReMoDetect 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日