Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query difficulty. To address this inefficiency, we formulate test-time compute allocation as a novel bandit learning problem and propose adaptive algorithms that estimate query difficulty on the fly and allocate compute accordingly. Compared to uniform allocation, our algorithms allocate more compute to challenging queries while maintaining accuracy on easier ones. Among challenging queries, our algorithms further learn to prioritize solvable instances, effectively reducing excessive computing on unsolvable queries. We theoretically prove that our algorithms achieve better compute efficiency than uniform allocation and empirically validate their effectiveness on math and code benchmarks. Specifically, our algorithms achieve up to an 11.10% performance improvement (15.04% relative) on the MATH-500 dataset, up to 10.82% (14.44% relative) on the AIME25 dataset, and up to an 11.23% performance improvement (15.29% relative) on the LiveCodeBench dataset.
翻译:将测试时计算资源进行缩放已成为提升大语言模型性能的有效策略。然而,现有方法通常对所有查询均匀分配计算资源,忽视了查询难度的差异。为解决这一效率问题,我们将测试时计算资源分配建模为一种新型赌博机学习问题,并提出自适应算法,该算法能实时估计查询难度并据此分配计算资源。与均匀分配相比,我们的算法为困难查询分配更多计算资源,同时保持简单查询的准确率。在困难查询中,算法进一步学习优先处理可解实例,有效减少对不可解查询的过度计算。我们从理论上证明,该算法相比均匀分配能实现更高的计算效率,并在数学与代码基准测试中实证验证其有效性。具体而言,在MATH-500数据集上性能提升达11.10%(相对15.04%),在AIME25数据集上达10.82%(相对14.44%),在LiveCodeBench数据集上达11.23%(相对15.29%)。