Large language models (LLMs) can achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet they often generate unnecessarily long reasoning paths that incur high inference cost. Self-consistency-based approaches push accuracy higher still, but they require sampling and aggregating multiple reasoning trajectories, leading to substantial computational overhead. In this paper, we introduce a confidence-aware selective sampling framework that, at inference time, analyzes a single reasoning trajectory to adaptively determine whether to rely on that trajectory alone or trigger multi-path sampling. The framework uses trajectory-level numeric features and sentence-level linguistic features extracted from reasoning states to guide selective multi-path reasoning. We train it on MedQA and evaluate it in-domain on MedQA and under calibration-only transfer on MathQA, MedMCQA, and MMLU, without further fine-tuning. Experimental results show that the proposed framework maintains comparable performance to full and efficient multi-path reasoning baselines, with accuracy changes of $-0.41 \pm 0.58$ and $-0.31 \pm 0.58$ percentage points, respectively, while reducing token usage by $71.7 \pm 5.0%$ and $36.6 \pm 9.1%$. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.
翻译:大型语言模型(LLM)可通过思维链推理获得强大推理性能,但常产生冗余推理路径导致高推理成本。基于自一致性的方法虽能进一步提升准确率,却需采样并聚合多条推理轨迹,造成显著计算开销。本文提出一种置信感知选择性采样框架,在推理时通过分析单条推理轨迹,自适应决定是仅依赖该轨迹还是触发多路径采样。该框架利用从推理状态中提取的轨迹级数值特征与句子级语言特征,引导选择性多路径推理。我们在MedQA数据集上训练模型,并在MedQA上开展域内评估,同时在MathQA、MedMCQA和MMLU上仅做校准迁移(无需微调)。实验表明,所提框架在准确率方面与完全及高效多路径推理基线保持可比性能(准确率变化分别为$-0.41 \pm 0.58$和$-0.31 \pm 0.58$个百分点),同时将令牌使用量分别降低$71.7 \pm 5.0\%$和$36.6 \pm 9.1\%$。这些发现证明推理轨迹蕴含丰富的置信度估计信号,可构建简单且可迁移的机制来平衡LLM推理中的准确率与效率。