Reasoning Large Language Models (LLMs) enable test-time scaling, with dataset-level accuracy improving as the token budget increases, motivating adaptive reasoning -- spending tokens when they improve reliability and stopping early when additional computation is unlikely to help. However, setting the token budget, as well as the threshold for adaptive reasoning, is a practical challenge that entails a fundamental risk-accuracy trade-off. We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute. Our framework introduces an upper threshold that stops reasoning when the model is confident (risking incorrect output) and a novel parametric lower threshold that preemptively stops unsolvable instances (risking premature stoppage). Given a target risk and a validation set, we use distribution-free risk control to optimally specify these stopping mechanisms. For scenarios with multiple budget controlling criteria, we incorporate an efficiency loss to select the most computationally efficient exiting mechanism. Empirical results across diverse reasoning tasks and models demonstrate the effectiveness of our risk control approach, demonstrating computational efficiency gains from the lower threshold and ensemble stopping mechanisms while adhering to the user-specified risk target.
翻译:推理型大语言模型(LLMs)支持测试时扩展,即随着令牌预算的增加,数据集级别的准确率会提升,这推动了自适应推理的发展——在令牌能提高可靠性时使用令牌,而在额外计算可能无益时提前停止。然而,设置令牌预算以及自适应推理的阈值是一个实际挑战,涉及根本性的风险-准确率权衡。我们将预算设置问题重新定义为风险控制,即在限制错误率的同时最小化计算量。我们的框架引入了一个上阈值,当模型置信度高时停止推理(承担输出错误的风险),以及一个新颖的参数化下阈值,可预先停止无法解决的实例(承担过早停止的风险)。给定目标风险和验证集,我们使用无分布风险控制来优化指定这些停止机制。对于存在多个预算控制标准的场景,我们引入效率损失来选择计算效率最高的退出机制。跨多种推理任务和模型的实证结果证明了我们风险控制方法的有效性,展示了下阈值和集成停止机制带来的计算效率提升,同时遵守用户指定的风险目标。