To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden labels and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 4 Sonnet-based assistant. The resulting assistant matches its counterpart trained on production-grade human labels on average, with higher scores on chat and safety yet lower scores on math and coding.
翻译:为了引导预训练语言模型适应下游任务,当前的后训练范式依赖于人类指定期望行为。然而,对于具备超人类能力的模型,获取高质量的人工监督变得困难甚至不可能。为应对这一挑战,我们提出了一种新的无监督算法——内部一致性最大化(ICM),该算法基于模型自身生成的标签对预训练语言模型进行微调,\emph{无需外部监督}。在GSM8k验证、TruthfulQA和Alpaca奖励建模任务中,我们的方法达到了使用黄金标签训练的性能水平,并优于基于众包人工监督的训练。在语言模型能力明显超越人类的任务上,我们的方法能够比基于人工标签的训练更有效地激发这些能力。最后,我们证明了该方法可提升前沿语言模型的训练效果:我们使用该方法训练了一个无监督奖励模型,并通过强化学习训练了一个基于Claude 4 Sonnet的助手。最终得到的助手在整体表现上与基于生产级人工标签训练的对应模型相当,在对话和安全方面得分更高,但在数学和编程方面得分略低。