As reasoning modules, such as the chain-of-thought mechanism, are applied to large language models, they achieve strong performance on various tasks such as answering common-sense questions and solving math problems. The main challenge now is to assess the uncertainty of answers, which can help prevent misleading or serious hallucinations for users. Although current methods analyze long reasoning sequences by filtering unrelated tokens and examining potential connections between nearby tokens or sentences, the temporal spread of confidence is often overlooked. This oversight can lead to inflated overall confidence, even when earlier steps exhibit very low confidence. To address this issue, we propose a novel method that incorporates inter-step attention to analyze semantic correlations across steps. For handling long-horizon responses, we introduce a hidden confidence mechanism to retain historical confidence information, which is then combined with stepwise confidence to produce a more accurate overall estimate. We evaluate our method on the GAOKAO math benchmark and the CLadder causal reasoning dataset using mainstream open-source large language models. Our approach is shown to outperform state-of-the-art methods by achieving a superior balance between predictive quality and calibration, demonstrated by strong performance on both Negative Log-Likelihood and Expected Calibration Error.
翻译:随着推理模块(如思维链机制)被应用于大语言模型,其在常识问答和数学问题求解等多种任务上取得了优异性能。当前的主要挑战在于评估答案的不确定性,这有助于避免对用户产生误导或严重的幻觉生成。尽管现有方法通过过滤无关标记并检查相邻标记或句子间的潜在关联来分析长推理序列,但置信度在时序上的传播常被忽视。这种忽略可能导致整体置信度被高估,即使早期步骤表现出极低的置信度。为解决这一问题,我们提出一种新方法,通过引入步骤间注意力机制来分析跨步骤的语义关联。针对长跨度响应,我们引入隐式置信机制以保留历史置信信息,再将其与逐步置信度结合,从而生成更准确的整体估计。我们在GAOKAO数学基准和CLadder因果推理数据集上使用主流开源大语言模型评估了本方法。实验表明,通过负对数似然与期望校准误差两项指标的优异表现,我们的方法在预测质量与校准效果间实现了更优的平衡,性能超越了现有最先进方法。