LLMs utilizing chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.
翻译:利用思维链推理的大型语言模型常因生成冗长的错误回答而浪费大量计算资源。弃权机制可通过抑制可能不正确的输出来缓解此问题。现有弃权方法多选择在生成前或生成后进行决策,而动态中期弃权则考虑在每个词元位置上提前终止无望的推理轨迹。虽有前期工作探索了该思想的经验变体,但关于弃权规则的理论指导仍显匮乏。本文对大型语言模型动态弃权进行形式化分析,将弃权建模为正则化强化学习框架中的显式动作。弃权奖励参数控制计算资源与信息之间的权衡。我们证明,在一般条件下,当价值函数低于该奖励时采用弃权策略,其性能严格优于自然基线方法。进一步推导出一种基于理论且高效的近似价值函数方法。数学推理与毒性规避任务的实证结果支持我们的理论,并表明该方法在选择性准确率上优于现有技术。