Recent research has generated hope that inference scaling, such as resampling solutions until they pass verifiers like unit tests, could allow weaker models to match stronger ones. Beyond inference, this approach also enables training reasoning models, where data is curated using rejection sampling against a verifier. However, we show that this approach is fundamentally limited when verifiers are imperfect and have a non-zero probability of producing false positives. Resampling cannot decrease this probability, so it imposes an upper bound to the accuracy of resampling-based inference scaling, regardless of compute budget. Our analysis shows that there is a strong correlation between the model's single-sample accuracy and its false positive rate on HumanEval and MBPP, whose unit tests have limited coverage. Therefore, no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model. Empirical results show that optimal sampling attempts are often fewer than 10, as the negative utility of false positives outweighs benefits, bending inference scaling curves downward. Finally, false positives may have other undesirable qualities, like poor adherence to coding style conventions.
翻译:近期研究带来希望,认为推理扩展(例如通过重采样解决方案直到其通过单元测试等验证器)能够使较弱模型匹配较强模型。除推理外,该方法还可用于训练推理模型,其中通过使用拒绝采样对验证器进行数据筛选。然而,我们证明当验证器存在缺陷且假阳性概率非零时,该方法存在根本性局限。重采样无法降低这一概率,因此在基于重采样的推理扩展中,无论计算预算如何,假阳性概率都会对准确性施加上限。我们的分析表明,在HumanEval和MBPP数据集上,模型的单样本准确率与假阳性率之间存在强相关性,而这两个数据集的单元测试覆盖率有限。因此,无论对较弱模型进行多少推理扩展,都无法使其在单样本准确率上匹配足够强的模型。实证结果显示,由于假阳性的负效用超过了收益,最优采样次数通常少于10次,导致推理扩展曲线向下弯曲。最后,假阳性可能具有其他不良特性,例如对编码风格约定的遵从性较差。