The development of trustworthy conversational information-seeking systems relies on dialogue models that can generate faithful and accurate responses based on relevant knowledge texts. However, two main challenges hinder this task. Firstly, language models may generate hallucinations due to data biases present in their pretraining corpus. Secondly, knowledge texts often contain redundant and irrelevant information that distracts the model's attention from the relevant text span. Previous works use additional data annotations on the knowledge texts to learn a knowledge identification module in order to bypass irrelevant information, but collecting such high-quality span annotations can be costly. In this work, we leverage reinforcement learning algorithms to overcome the above challenges by introducing a novel reward function. Our reward function combines an accuracy metric and a faithfulness metric to provide a balanced quality judgment of generated responses, which can be used as a cost-effective approximation to a human preference reward model when only a few preference annotations are available. Empirical experiments on two conversational information-seeking datasets demonstrate that our method can compete with other strong supervised learning baselines.
翻译:可信赖的对话式信息检索系统的发展依赖于能够基于相关知识文本生成忠实且准确响应的对话模型。然而,这一任务面临两大挑战:首先,语言模型可能因预训练语料库中存在的数据偏差而产生幻觉;其次,知识文本常包含冗余和不相关信息,会干扰模型对相关文本片段的注意力。以往研究通过额外对知识文本进行数据标注来学习知识识别模块以规避无关信息,但收集此类高质量片段标注成本高昂。本文利用强化学习算法,通过引入一种新型奖励函数克服上述挑战。该奖励函数结合准确性指标与忠实性指标,对生成的响应提供均衡的质量评判,可在仅获得少量偏好标注时作为人类偏好奖励模型的经济高效替代方案。在两个对话式信息检索数据集上的实证实验表明,我们的方法能够与强大的有监督学习基线方法相竞争。