Language models are capable of iteratively improving their outputs based on natural language feedback, thus enabling in-context optimization of user preference. In place of human users, a second language model can be used as an evaluator, providing feedback along with numerical ratings which the generator attempts to optimize. However, because the evaluator is an imperfect proxy of user preference, this optimization can lead to reward hacking, where the evaluator's ratings improve while the generation quality remains stagnant or even decreases as judged by actual user preference. The concern of reward hacking is heightened in iterative self-refinement where the generator and the evaluator use the same underlying language model, in which case the optimization pressure can drive them to exploit shared vulnerabilities. Using an essay editing task, we show that iterative self-refinement leads to deviation between the language model evaluator and human judgment, demonstrating that reward hacking can occur spontaneously in-context with the use of iterative self-refinement. In addition, we study conditions under which reward hacking occurs and observe two factors that affect reward hacking severity: model size and context sharing between the generator and the evaluator.
翻译:语言模型能够基于自然语言反馈迭代改进其输出,从而实现对用户偏好的上下文优化。替代人类用户,可以使用第二个语言模型作为评估器,提供反馈及数值评分,供生成器尝试优化。然而,由于评估器仅是用户偏好的不完美代理,这种优化可能导致奖励破解现象:即评估器评分提升的同时,生成质量(以实际用户偏好判断)却停滞不前甚至下降。在迭代式自我优化中,当生成器与评估器使用相同的底层语言模型时,奖励破解的担忧尤为突出——优化压力可能驱使二者利用共享的脆弱性。通过文章编辑任务,我们证明迭代式自我优化会导致语言模型评估器与人类判断之间产生偏差,表明奖励破解可在迭代式自我优化的上下文使用中自发产生。此外,我们研究了奖励破解发生的条件,并观察到两个影响其严重程度的因素:模型规模,以及生成器与评估器之间的上下文共享程度。