Large Language Models (LLMs) often exhibit misaligned confidence scores, usually overestimating the reliability of their predictions. While verbalized confidence in Large Language Models (LLMs) has gained attention, prior work remains divided on whether confidence scores can be systematically steered through prompting. Recent studies even argue that such prompt-induced confidence shifts are negligible, suggesting LLMs' confidence calibration is rigid to linguistic interventions. Contrary to these claims, we first rigorously confirm the existence of directional confidence shifts by probing three models (including GPT3.5, LLAMA3-70b, GPT4) across 7 benchmarks, demonstrating that explicit instructions can inflate or deflate confidence scores in a regulated manner. Based on this observation, we propose a novel framework containing three components: confidence steering, steered confidence aggregation and steered answers selection, named SteeringConf. Our method, SteeringConf, leverages a confidence manipulation mechanism to steer the confidence scores of LLMs in several desired directions, followed by a summarization module that aggregates the steered confidence scores to produce a final prediction. We evaluate our method on 7 benchmarks and it consistently outperforms the baselines in terms of calibration metrics in task of confidence calibration and failure detection.
翻译:大型语言模型(LLMs)常表现出置信度分数失准的问题,通常会高估其预测的可靠性。尽管大语言模型的言语化置信度已引起关注,先前研究对于能否通过提示系统性地调控置信度分数仍存在分歧。近期研究甚至认为此类提示诱导的置信度偏移可以忽略不计,表明LLMs的置信度校准对语言干预具有刚性。与这些观点相反,我们首先通过在三款模型(包括GPT3.5、LLAMA3-70b和GPT4)上对7个基准测试进行探测,严格证实了定向置信度偏移的存在,证明显式指令能够以受控方式提升或降低置信度分数。基于此发现,我们提出了一个包含三个组件的新框架:置信度调控、调控后置信度聚合与调控后答案选择,命名为SteeringConf。我们的方法SteeringConf利用置信度调控机制将LLMs的置信度分数引导至多个预期方向,随后通过聚合模块对调控后的置信度分数进行整合以生成最终预测。我们在7个基准测试上评估了该方法,在置信度校准和故障检测任务中,其校准指标持续优于基线方法。