We propose a general two-stage algorithm that enjoys a provable scaling law for the test-time compute of large language models (LLMs). Given an input problem, the proposed algorithm first generates $N$ candidate solutions, and then chooses the best one via a multiple-round knockout tournament where each pair of candidates are compared for $K$ times and only the winners move on to the next round. In a minimalistic implementation, both stages can be executed with a black-box LLM alone and nothing else (e.g., no external verifier or reward model), and a total of $N \times (K + 1)$ highly parallelizable LLM calls are needed for solving an input problem. Assuming that a generated candidate solution is correct with probability $p_{\text{gen}} > 0$ and a comparison between a pair of correct and incorrect solutions identifies the right winner with probability $p_{\text{comp}} > 0.5$ (i.e., better than a random guess), we prove theoretically that the failure probability of the proposed algorithm decays to zero exponentially with respect to $N$ and $K$: $$\mathbb{P}(\text{final output is incorrect}) \le (1 - p_{\text{gen}})^N + \lceil \log_2 N \rceil e^{-2 K (p_{\text{comp}} - 0.5)^2}.$$ Our empirical results with the challenging MMLU-Pro benchmark validate the technical assumptions, as well as the efficacy of the proposed algorithm and the gains from scaling up its test-time compute.
翻译:我们提出了一种通用的两阶段算法,该算法为大语言模型(LLM)的测试时计算提供了一个可证明的规模定律。对于给定的输入问题,所提出的算法首先生成 $N$ 个候选解决方案,然后通过多轮淘汰锦标赛选择最佳方案,其中每对候选方案进行 $K$ 次比较,只有胜者进入下一轮。在一个极简的实现中,两个阶段都可以仅使用一个黑盒 LLM 执行,无需其他任何组件(例如,无需外部验证器或奖励模型),并且解决一个输入问题总共需要 $N \times (K + 1)$ 次高度可并行化的 LLM 调用。假设生成的候选解决方案正确的概率为 $p_{\text{gen}} > 0$,并且对一对正确和错误解决方案的比较能以概率 $p_{\text{comp}} > 0.5$(即优于随机猜测)识别出正确的胜者,我们从理论上证明了所提算法的失败概率随 $N$ 和 $K$ 呈指数衰减至零:$$\mathbb{P}(\text{最终输出错误}) \le (1 - p_{\text{gen}})^N + \lceil \log_2 N \rceil e^{-2 K (p_{\text{comp}} - 0.5)^2}.$$ 我们在具有挑战性的 MMLU-Pro 基准测试上的实证结果验证了技术假设,以及所提算法的有效性及其通过扩展测试时计算所带来的增益。