Large Language Models (LLMs) often generate erroneous outputs, known as hallucinations, due to their limitations in discerning questions beyond their knowledge scope. While addressing hallucination has been a focal point in research, previous efforts primarily concentrate on enhancing correctness without giving due consideration to the significance of rejection mechanisms. In this paper, we conduct a comprehensive examination of the role of rejection, introducing the notion of model reliability along with corresponding metrics. These metrics measure the model's ability to provide accurate responses while adeptly rejecting questions exceeding its knowledge boundaries, thereby minimizing hallucinations. To improve the inherent reliability of LLMs, we present a novel alignment framework called Reinforcement Learning from Knowledge Feedback (RLKF). RLKF leverages knowledge feedback to dynamically determine the model's knowledge boundary and trains a reliable reward model to encourage the refusal of out-of-knowledge questions. Experimental results on mathematical questions affirm the substantial efficacy of RLKF in significantly enhancing LLM reliability.
翻译:大语言模型(LLMs)因难以甄别超出其知识范围的问题,常生成错误输出,即所谓幻觉。尽管解决幻觉问题一直是研究重点,但先前工作主要聚焦于提升正确性,而未充分重视拒绝机制的作用。本文系统考察了拒绝机制的角色,引入模型可靠性概念及相应度量标准。这些指标衡量模型在准确回答问题的同时,能够灵活拒绝超出知识边界问题的能力,从而最大程度减少幻觉。为提升大语言模型的内在可靠性,我们提出一种名为“基于知识反馈的强化学习”(RLKF)的新型对齐框架。RLKF利用知识反馈动态确定模型的知识边界,并训练可靠的奖励模型以鼓励模型拒绝超出知识范围的问题。在数学问题上的实验结果证实,RLKF在显著提升大语言模型可靠性方面具有显著效果。