Reverse Engineering Human Preferences with Reinforcement Learning

The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework--known as LLM-as-a-judge--is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model's response, our method is virtually undetectable. We also demonstrate that the effectiveness of the tuned preamble generator transfers when the candidate-LLM and the judge-LLM are replaced with models that are not used during training. These findings raise important questions about the design of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that human preferences can be reverse engineered effectively, by pipelining LLMs to optimise upstream preambles via reinforcement learning--an approach that could find future applications in diverse tasks and domains beyond adversarial attacks.

翻译：大型语言模型（LLM）的能力通常由其他经过训练以预测人类偏好的LLM进行评估。这一框架——被称为“LLM即评判者”——具有高度可扩展性且成本相对较低。然而，它也容易受到恶意利用，因为LLM的响应可以被调整以过度拟合评判者的偏好。先前的研究表明，候选LLM生成的答案可以在事后被编辑，以最大化评判者LLM赋予它们的分数。在本研究中，我们采用了一种不同的方法，利用评判者LLM提供的信号作为奖励，通过对抗性调优来生成旨在提升下游性能的文本前导模型。我们发现，与这些模型级联的冻结LLM获得了比现有框架更高的LLM评估分数。关键的是，与直接干预模型响应的其他框架不同，我们的方法几乎无法被检测。我们还证明了，当候选LLM和评判者LLM被替换为训练期间未使用的模型时，经过调优的前导生成器的有效性依然能够迁移。这些发现对设计更可靠的“LLM即评判者”评估设置提出了重要问题。它们也表明，通过将LLM级联以通过强化学习优化上游前导，可以有效地逆向工程人类偏好——这一方法未来可能在对抗攻击之外的多种任务和领域中找到应用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日