The evaluation and post-training of large language models (LLMs) rely on supervision, but strong supervision for difficult tasks is often unavailable, especially when evaluating frontier models. In such cases, models are demonstrated to exploit evaluations built on such imperfect supervision, leading to deceptive results. However, underutilized in LLM research, a wealth of mechanism design research focuses on game-theoretic incentive compatibility, i.e., eliciting honest and informative answers with weak supervision. Drawing from this literature, we introduce the peer prediction method for model evaluation and post-training. It rewards honest and informative answers over deceptive and uninformative ones, using a metric based on mutual predictability and without requiring ground truth labels. We demonstrate the method's effectiveness and resistance to deception, with both theoretical guarantees and empirical validation on models with up to 405B parameters. We show that training an 8B model with peer prediction-based reward recovers most of the drop in truthfulness due to prior malicious finetuning, even when the reward is produced by a 0.135B language model with no finetuning. On the evaluation front, in contrast to LLM-as-a-Judge which requires strong and trusted judges, we discover an inverse scaling property in peer prediction, where, surprisingly, resistance to deception is strengthened as the capability gap between the experts and participants widens, enabling reliable evaluation of strong models with weak supervision. In particular, LLM-as-a-Judge become worse than random guess when facing deceptive models 5-20x the judge's size, while peer prediction thrives when such gaps are large, including in cases with over 100x size difference.
翻译:大型语言模型(LLM)的评估与后训练依赖于监督信号,然而针对困难任务的强监督往往难以获取,这在评估前沿模型时尤为突出。在此类场景下,模型被证实会利用基于不完善监督构建的评估机制,导致具有欺骗性的结果。机制设计领域长期关注博弈论激励相容性研究(即在弱监督下获取诚实且信息丰富的回答),但该方向在LLM研究中尚未得到充分利用。借鉴这一理论体系,我们提出用于模型评估与后训练的同伴预测方法。该方法基于互预测性构建度量指标,无需真实标签即可奖励诚实信息性回答,同时抑制欺骗性与无信息性回答。我们通过理论证明与最高405B参数模型的实证验证,展示了该方法在抵抗欺骗方面的有效性。实验表明:使用基于同伴预测的奖励训练8B模型,即使奖励由未经微调的0.135B语言模型生成,仍能恢复因先前恶意微调导致的大部分真实性下降。在评估层面,与需要强可信裁判的“LLM即裁判”方法相比,我们发现了同伴预测的逆缩放特性:当专家与参与者能力差距扩大时,抗欺骗性反而增强,这使得弱监督下可靠评估强模型成为可能。具体而言,“LLM即裁判”在面对规模5-20倍于裁判的欺骗性模型时表现差于随机猜测,而同伴预测在包括超百倍规模差距的场景中仍能保持优异性能。