While test-time scaling has enabled large language models to solve highly difficult tasks, state-of-the-art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post-trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test-time training. Specifically, we introduce a meta-learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level $δ=0.1$, ORCA improves Qwen2.5-32B efficiency on in-distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self-consistency labels. Under zero-shot out-of-domain settings, it improves MATH-500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at https://github.com/wzekai99/ORCA.
翻译:尽管测试时扩展使大语言模型能够解决高难度任务,但最先进的结果伴随着高昂的计算成本。这些低效率可归因于后训练语言模型的校准不足,以及主流采样技术缺乏校准。本文提出在线推理校准(ORCA)框架——一种结合保形预测与测试时训练来校准采样过程的方案。具体而言,我们引入元学习过程,为每个输入更新校准模块。这使得在分布偏移下(例如推理不同阶段出现的思维模式偏移,或模型开发与部署之间的提示分布偏移)能够提供有效的置信度估计。ORCA不仅提供保形风险的理论保障,还通过实验证明其在各类推理任务中具有更高的效率与泛化能力。在风险水平δ=0.1下,ORCA将Qwen2.5-32B在分布内任务上的效率提升至监督标签节省47.5%、自一致性标签节省40.7%。在零样本域外设置中,该方法将MATH-500的节省率从静态校准基线的24.8%提升至67.0%,同时保持较低的经验错误率,该趋势在模型族与下游基准测试中保持一致。我们的代码已开源至https://github.com/wzekai99/ORCA。