Reverse Engineering Human Preferences with Reinforcement Learning

The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework--known as LLM-as-a-judge--is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model's response, our method is virtually undetectable. We also demonstrate that the effectiveness of the tuned preamble generator transfers when the candidate-LLM and the judge-LLM are replaced with models that are not used during training. These findings raise important questions about the design of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that human preferences can be reverse engineered effectively, by pipelining LLMs to optimise upstream preambles via reinforcement learning--an approach that could find future applications in diverse tasks and domains beyond adversarial attacks.

翻译：大型语言模型（LLM）的能力通常由其他经过训练以预测人类偏好的LLM进行评估。这一框架——称为LLM-as-a-judge——具有高度可扩展性且成本相对较低。然而，它也容易受到恶意利用，因为LLM的响应可以被调整以过度拟合评判模型的偏好。先前的研究表明，候选LLM生成的答案可以在事后被编辑，以最大化评判LLM分配给它们的分数。在本研究中，我们采用了一种不同的方法，利用评判LLM提供的信号作为奖励，通过对抗性调优生成旨在提升下游性能的文本前导语的模型。我们发现，与这些模型级联的冻结LLM获得的LLM评估分数高于现有框架。关键的是，与其他直接干预模型响应的框架不同，我们的方法几乎无法被检测到。我们还证明，当候选LLM和评判LLM被替换为训练期间未使用的模型时，经过调优的前导语生成器的有效性依然能够迁移。这些发现对设计更可靠的LLM-as-a-judge评估设置提出了重要问题。它们也表明，通过将LLM级联以通过强化学习优化上游前导语，可以有效地逆向工程人类偏好——这一方法未来可能在对抗攻击之外的多种任务和领域中找到应用。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

智能体评判者（Agent-as-a-Judge）研究综述

专知会员服务

37+阅读 · 1月9日

基于强化学习的智能体化搜索全面综述：基础、角色、优化、评估与应用

专知会员服务

23+阅读 · 2025年10月22日

迈向大语言模型偏好学习的统一视角综述

专知会员服务

24+阅读 · 2024年9月7日

大规模语言模型的人类偏好学习综述

专知会员服务

42+阅读 · 2024年6月19日