Despite the success of large language models (LLMs) in natural language generation, much evidence shows that LLMs may produce incorrect or nonsensical text. This limitation highlights the importance of discerning when to trust LLMs, especially in safety-critical domains. Existing methods, which rely on verbalizing confidence to tell the reliability by inducing top-k responses and sampling-aggregating multiple responses, often fail, due to the lack of objective guidance of confidence. To address this, we propose CONfidence-Quality-ORDerpreserving alignment approach (CONQORD), leveraging reinforcement learning with a tailored dual-component reward function. This function encompasses quality reward and orderpreserving alignment reward functions. Specifically, the order-preserving reward incentivizes the model to verbalize greater confidence for responses of higher quality to align the order of confidence and quality. Experiments demonstrate that our CONQORD significantly improves the alignment performance between confidence levels and response accuracy, without causing the model to become over-cautious. Furthermore, the aligned confidence provided by CONQORD informs when to trust LLMs, and acts as a determinant for initiating the retrieval process of external knowledge. Aligning confidence with response quality ensures more transparent and reliable responses, providing better trustworthiness.
翻译:尽管大语言模型在自然语言生成方面取得了成功,但大量证据表明,大语言模型可能产生不正确或无意义的文本。这一局限性凸显出辨别何时信任大语言模型的重要性,尤其是在安全关键领域。现有方法依赖通过诱导 top-k 响应和采样聚合多个响应来语言化表达置信度以判断可靠性,但由于缺乏对置信度的客观指引,常常失效。为解决这一问题,我们提出了 CONfidence-Quality-ORDerpreserving alignment approach(CONQORD),利用强化学习结合定制的双组件奖励函数。该函数包含质量奖励函数和保序对齐奖励函数。具体而言,保序奖励激励模型对更高质量的响应表达更高的置信度,以实现置信度与质量的顺序对齐。实验表明,我们的 CONQORD 显著提升了置信度水平与响应准确性之间的对齐性能,同时避免模型变得过度谨慎。此外,CONQORD 提供的对齐置信度可指示何时信任大语言模型,并作为启动外部知识检索过程的关键决定因素。将置信度与响应质量对齐可确保更透明、更可靠的响应,从而提升可信度。