Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.
翻译:大型语言模型(LLM)正在诸多领域迅速超越人类知识水平。传统上,提升这些模型依赖于昂贵的人工标注数据,而近期的自奖励机制(Yuan等人,2024)表明,LLM可以通过评判自身响应而非依赖人工标注来实现自我改进。然而,现有方法主要聚焦于改进模型响应能力而非评判能力,导致迭代训练过程中性能迅速饱和。为解决此问题,我们在自改进过程中引入了一种新颖的元奖励步骤:模型对自身的评判进行再评判,并利用该反馈优化其评判能力。令人惊讶的是,这种无监督方法同时提升了模型的指令遵循与评判能力——Llama-3-8B-Instruct在AlpacaEval 2上的胜率从22.9%提升至39.4%,在Arena-Hard上从20.6%提升至29.1%。这些结果有力证明了无需人类监督的自改进模型具有巨大潜力。