When answering questions, LLMs can convey not only an answer, but a level of confidence about the answer being correct. This includes explicit confidence markers (e.g. giving a numeric score) as well as implicit markers, like an authoritative tone or elaborating with additional knowledge. For LLMs to be trustworthy knowledge sources, the confidence they convey should match their actual expertise; however, most current models tend towards overconfidence. To calibrate both implicit and explicit confidence markers, we introduce a pragmatic, listener-aware finetuning method (LACIE) that models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener. We cast calibration as preference optimization, creating data via a two-agent game, where a speaker model's outputs are judged by a simulated listener. We then finetune three LLMs (Mistral-7B, Llama3-8B, Llama3-70B) with LACIE, and show that the resulting models are better calibrated w.r.t. a simulated listener. Crucially, these trends transfer to human listeners, helping them correctly predict model correctness: we conduct a human evaluation where annotators accept or reject an LLM's answers, finding that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers. Furthermore, LACIE generalizes to another dataset, resulting in a large increase in truthfulness on TruthfulQA when trained on TriviaQA. Our analysis indicates that LACIE leads to a better confidence separation between correct and incorrect examples. Qualitatively, we find that a LACIE-trained model hedges more and implicitly signals certainty when it is correct by using an authoritative tone or including details. Finally, LACIE finetuning leads to an emergent increase in model abstention (e.g. saying "I don't know") for answers that are likely wrong.
翻译:在回答问题时,大语言模型(LLMs)不仅能提供答案,还能传达其答案正确的置信度。这既包括显式置信度标记(例如给出数值评分),也包括隐式标记,如权威性语气或通过补充知识进行详细阐述。要使大语言模型成为可信的知识来源,其传达的置信度应与实际专业知识相匹配;然而,当前大多数模型普遍存在过度自信倾向。为校准隐式与显式置信度标记,我们提出一种实用主义的听者感知微调方法(LACIE),该方法通过建模听者行为,不仅考量答案是否正确,同时评估答案是否会被听者接受。我们将校准问题构建为偏好优化任务,通过双智能体博弈生成数据:模拟听者对说话者模型的输出进行评判。随后,我们使用LACIE对三个大语言模型(Mistral-7B、Llama3-8B、Llama3-70B)进行微调,实验表明所得模型在模拟听者场景下具有更优的校准效果。关键的是,这种优势能迁移至人类听者场景:我们开展了一项人工评估,标注者对大语言模型的答案进行接受/拒绝判断,发现经过LACIE训练的模型在保持正确答案接受率不变的同时,使错误答案被接受的比例降低了47%。此外,LACIE展现出跨数据集泛化能力:在TriviaQA上训练后,模型在TruthfulQA数据集上的真实性指标大幅提升。分析表明,LACIE能增强正确与错误示例间的置信度区分度。定性研究发现,经LACIE训练的模型在答案正确时更频繁地使用权威语气或补充细节来隐式传递确定性,同时在答案可能错误时更倾向于采取回避策略(例如声明“我不知道”)。