Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.
翻译:大型语言模型(LLM)正迅速超越人类在众多领域的知识水平。传统上,改进这些模型依赖于昂贵的人工标注数据,而最近的自奖励机制(Yuan等人,2024)表明,LLM可以通过评判自身响应而非依赖人工标注者来实现自我改进。然而,现有方法主要聚焦于提升模型响应能力而非评判能力,导致迭代训练过程中性能迅速饱和。为解决这一问题,我们在自改进过程中引入了一种新颖的元奖励步骤:模型对自身的评判进行再评判,并利用该反馈优化其评判技能。令人惊讶的是,这种无监督方法同时提升了模型的指令遵循与评判能力——Llama-3-8B-Instruct在AlpacaEval 2上的胜率从22.9%提升至39.4%,在Arena-Hard上从20.6%提升至29.1%。这些结果有力证明了无需人类监督的自改进模型具有巨大潜力。