Large Language Models (LLMs) often generate erroneous outputs, known as hallucinations, due to their limitations in discerning questions beyond their knowledge scope. While addressing hallucination has been a focal point in research, previous efforts primarily concentrate on enhancing correctness without giving due consideration to the significance of rejection mechanisms. In this paper, we conduct a comprehensive examination of the role of rejection, introducing the notion of model reliability along with corresponding metrics. These metrics measure the model's ability to provide accurate responses while adeptly rejecting questions exceeding its knowledge boundaries, thereby minimizing hallucinations. To improve the inherent reliability of LLMs, we present a novel alignment framework called Reinforcement Learning from Knowledge Feedback (RLKF). RLKF leverages knowledge feedback to dynamically determine the model's knowledge boundary and trains a reliable reward model to encourage the refusal of out-of-knowledge questions. Experimental results on mathematical questions affirm the substantial efficacy of RLKF in significantly enhancing LLM reliability.
翻译:大型语言模型(LLM)常因无法识别超出其知识范围的问题而产生错误输出,即所谓的“幻觉”。尽管解决幻觉问题一直是研究重点,但先前工作主要集中于提升回答正确性,而未充分重视拒绝机制的重要性。本文对拒绝的作用进行了全面考察,引入了模型可靠性的概念及相应评估指标。这些指标衡量模型在提供准确回答的同时,能否有效拒绝超出其知识边界的问题,从而最大限度减少幻觉。为提升LLM的固有可靠性,我们提出了一种名为“基于知识反馈的强化学习”(RLKF)的新型对齐框架。RLKF利用知识反馈动态确定模型的知识边界,并训练可靠的奖励模型以鼓励拒绝知识范围外的问题。在数学问题上的实验结果证实了RLKF在显著提升LLM可靠性方面具有显著成效。