Large language models (LLMs) using chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.
翻译:采用思维链推理的大语言模型常因生成冗长且错误的回答而浪费大量计算资源。弃权机制通过抑制可能错误的输出来缓解此问题。现有弃权方法多在生成前或生成后决定是否输出,而动态中期弃权则考虑在生成过程中每个标记位置提前终止无前景的推理路径。已有研究探索了该思路的经验性变体,但缺乏对弃权规则的原则性指导。我们提出大语言模型动态弃权的形式化分析,将弃权建模为正则化强化学习框架中的显式动作。弃权奖励参数控制计算量与信息量的权衡。研究表明,当价值函数低于该奖励时选择弃权,在一般条件下严格优于自然基线方法。我们进一步推导出逼近价值函数的原则性高效方法。数学推理与毒性规避任务的实验结果支持我们的理论,并表明相比于现有方法具有更优的选择性准确率。