Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.
翻译:大语言模型在推理方面表现出色,但缺乏自我反思的关键方面,包括预测自身成功及实现该成功所需的计算量。人类利用实时内省来决定投入多少努力、何时进行多次尝试、何时停止以及何时发出成功或失败的信号。缺乏这种能力,大语言模型难以做出智能的元认知决策。像Best-of-N这样的测试时扩展方法采用固定样本预算,无论每个样本在生成过程中的边际效益如何,都会推高成本和延迟;而置信度信号的缺失可能误导用户,阻碍适时升级至更优工具,并损害可信度。学习型验证器或奖励模型可提供置信度估计,但无法实现自适应推理,且因需要额外模型或前向传播而增加显著成本。我们提出ZIP-RC,一种自适应推理方法,使模型能够在推理时以零开销预测奖励与成本。在每个词元处,ZIP-RC复用同一前向传播中预留或未使用的逻辑值(与下一词元预测并行),输出最终奖励与剩余长度的联合分布——无需额外模型、架构变更或推理开销。该完整联合分布用于计算采样效用,即若样本集生成至完成时,其预期最大奖励、总计算量与延迟的线性组合。在推理过程中,我们通过元动作最大化此效用,这些元动作决定继续或从哪些词元前缀开始采样。在混合难度数学基准测试中,ZIP-RC在相同或更低平均成本下,相较多数投票法将准确率提升高达12%,并在质量、计算量与延迟之间勾勒出平滑的帕累托前沿。通过提供实时奖励-成本内省,ZIP-RC实现了自适应、高效的推理。