Chain-of-thought (CoT) prompting is a common technique for improving the reasoning abilities of large language models (LLMs). However, extended reasoning is often unnecessary and substantially increases token usage. As such, a key question becomes how to optimally allocate compute to when reasoning is actually needed. We study this through confidence-gated CoT, where a model produces a direct answer and a confidence estimate to decide whether to invoke CoT. We present an evaluation framework together with the first systematic study of confidence signals for this decision. We evaluate four representative confidence measures and compare them with random gating and an oracle upper bound. Experiments across two model families and diverse reasoning tasks show that existing training-free confidence measures can reduce redundant reasoning. However, we also find that the utility of individual confidence measures is inconsistent across settings. Through our evaluation framework and analysis, our study provides practical guidance toward developing and evaluating models that selectively use CoT.
翻译:思维链提示是提升大语言模型推理能力的常用技术。然而,扩展推理往往并非必要,且会显著增加令牌消耗。因此,一个关键问题在于如何将计算资源最优地分配给真正需要推理的场景。我们通过置信度门控思维链对此进行研究:模型首先生成直接答案及置信度估计,据此决定是否调用思维链推理。我们提出了一个评估框架,并首次系统性地研究了用于该决策的置信度信号。我们评估了四种代表性置信度度量方法,并将其与随机门控及理论上限的预言机基准进行比较。在两个模型系列及多样化推理任务上的实验表明,现有的免训练置信度度量方法能够减少冗余推理。然而,我们也发现单一置信度度量方法的效用会因场景不同而呈现不一致性。通过本研究的评估框架与分析,我们为开发与评估选择性使用思维链的模型提供了实践指导。