The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework--known as LLM-as-a-judge--is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model's response, our method is virtually undetectable. We also demonstrate that the effectiveness of the tuned preamble generator transfers when the candidate-LLM and the judge-LLM are replaced with models that are not used during training. These findings raise important questions about the design of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that human preferences can be reverse engineered effectively, by pipelining LLMs to optimise upstream preambles via reinforcement learning--an approach that could find future applications in diverse tasks and domains beyond adversarial attacks.
翻译:大型语言模型(LLM)的能力通常由其他经过训练以预测人类偏好的LLM进行评估。这一框架——被称为“LLM即评判者”——具有高度可扩展性且成本相对较低。然而,它也容易受到恶意利用,因为LLM的响应可以被调整以过度拟合评判者的偏好。先前的研究表明,候选LLM生成的答案可以在事后被编辑,以最大化评判者LLM赋予它们的分数。在本研究中,我们采用了一种不同的方法,利用评判者LLM提供的信号作为奖励,通过对抗性调优来生成旨在提升下游性能的文本前导模型。我们发现,与这些模型级联的冻结LLM获得了比现有框架更高的LLM评估分数。关键的是,与直接干预模型响应的其他框架不同,我们的方法几乎无法被检测。我们还证明了,当候选LLM和评判者LLM被替换为训练期间未使用的模型时,经过调优的前导生成器的有效性依然能够迁移。这些发现对设计更可靠的“LLM即评判者”评估设置提出了重要问题。它们也表明,通过将LLM级联以通过强化学习优化上游前导,可以有效地逆向工程人类偏好——这一方法未来可能在对抗攻击之外的多种任务和领域中找到应用。