Verification of model outputs is rapidly emerging as a key primitive for both training and real-world deployment of large language models (LLMs). In practice, this often involves using imperfect LLM judges and reward models since ground truth acquisition can be time-consuming and expensive. We introduce Fully Unsupervised Score Ensembling (FUSE), a method for improving verification quality by ensembling verifiers without access to ground truth correctness labels. The key idea behind FUSE is to control conditional dependencies between verifiers in a manner that improves the unsupervised performance of a class of spectral algorithms from the ensembling literature. Despite requiring zero ground truth labels, FUSE typically matches or improves upon semi-supervised alternatives in test-time scaling experiments with diverse sets of generator models, verifiers, and benchmarks. In particular, we validate our method on both conventional academic benchmarks such as GPQA Diamond and on frontier, unsaturated benchmarks such as Humanity's Last Exam and IMO Shortlist questions.
翻译:模型输出的验证正迅速成为大语言模型训练与真实部署中的关键基础操作。实践中,由于真实标注获取耗时且昂贵,通常需要借助不完美的LLM裁判和奖励模型。我们提出全无监督评分集成方法(FUSE),这是一种无需真实正确性标注即可通过集成验证器来提升验证质量的方法。FUSE的核心思想是通过控制验证器间的条件依赖关系,改进集成文献中一类谱算法在无监督场景下的表现。尽管完全不依赖真实标注,FUSE在使用不同生成器模型、验证器和基准进行测试时扩展实验中的表现通常达到或超过半监督方案。特别是,我们不仅在传统学术基准(如GPQA Diamond)上验证了该方法,还在前沿、未饱和的基准(如《人类最后的考试》和IMO Shortlist题目)上进行了验证。