Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, which equips models with zero-overhead introspective predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.
翻译:大型语言模型在推理方面表现出色,但缺乏内省的关键方面,包括预测自身成功以及实现成功所需的计算量。人类利用实时内省来决定投入多少努力、何时进行多次尝试、何时停止以及何时发出成功或失败的信号。缺乏这种能力,LLMs难以做出智能的元认知决策。诸如Best-of-N等测试时扩展方法通过使用固定的样本预算(无论每个样本在生成过程中的任何时刻的边际效益如何)来推高成本和延迟,而置信度信号的缺失可能误导人们、阻碍适当升级到更好的工具,并损害可信度。学习到的验证器或奖励模型可以提供置信度估计,但无法实现自适应推理,并且因需要额外模型或前向传递而增加大量成本。我们提出了ZIP-RC,它赋予模型对奖励和成本的零开销内省预测能力。在每个标记处,ZIP-RC在同一个前向传递中复用保留或未使用的逻辑值(作为下一个标记预测的一部分),以输出关于最终奖励和剩余长度的联合分布——无需额外模型、架构更改或推理开销。该完整的联合分布用于计算采样效用,该效用是如果生成到完成时,一组样本的期望最大奖励、总计算量和延迟的线性组合。在推理过程中,我们通过元操作最大化该效用,这些元操作决定继续生成或开始采样的标记前缀。在混合难度数学基准测试中,ZIP-RC在相同或更低的平均成本下,相比多数投票法将准确率提高了高达12%,并在质量、计算量和延迟之间描绘出平滑的帕累托前沿。通过提供实时的奖励-成本内省,ZIP-RC实现了自适应、高效的推理。