The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework--known as LLM-as-a-judge--is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model's response, our method is virtually undetectable. We also demonstrate that the effectiveness of the tuned preamble generator transfers when the candidate-LLM and the judge-LLM are replaced with models that are not used during training. These findings raise important questions about the design of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that human preferences can be reverse engineered effectively, by pipelining LLMs to optimise upstream preambles via reinforcement learning--an approach that could find future applications in diverse tasks and domains beyond adversarial attacks.
翻译:大型语言模型(LLM)的能力通常由其他经过训练以预测人类偏好的LLM进行评估。这一框架——称为LLM-as-a-judge——具有高度可扩展性且成本相对较低。然而,它也容易受到恶意利用,因为LLM的响应可以被调整以过度拟合评判模型的偏好。先前的研究表明,候选LLM生成的答案可以在事后被编辑,以最大化评判LLM分配给它们的分数。在本研究中,我们采用了一种不同的方法,利用评判LLM提供的信号作为奖励,通过对抗性调优生成旨在提升下游性能的文本前导语的模型。我们发现,与这些模型级联的冻结LLM获得的LLM评估分数高于现有框架。关键的是,与其他直接干预模型响应的框架不同,我们的方法几乎无法被检测到。我们还证明,当候选LLM和评判LLM被替换为训练期间未使用的模型时,经过调优的前导语生成器的有效性依然能够迁移。这些发现对设计更可靠的LLM-as-a-judge评估设置提出了重要问题。它们也表明,通过将LLM级联以通过强化学习优化上游前导语,可以有效地逆向工程人类偏好——这一方法未来可能在对抗攻击之外的多种任务和领域中找到应用。