Retrieval-augmented language models (RALMs) represent a substantial advancement in the capabilities of large language models, notably in reducing factual hallucination by leveraging external knowledge sources. However, the reliability of the retrieved information is not always guaranteed. The retrieval of irrelevant data can lead to misguided responses, and potentially causing the model to overlook its inherent knowledge, even when it possesses adequate information to address the query. Moreover, standard RALMs often struggle to assess whether they possess adequate knowledge, both intrinsic and retrieved, to provide an accurate answer. In situations where knowledge is lacking, these systems should ideally respond with "unknown" when the answer is unattainable. In response to these challenges, we introduces Chain-of-Noting (CoN), a novel approach aimed at improving the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios. The core idea of CoN is to generate sequential reading notes for retrieved documents, enabling a thorough evaluation of their relevance to the given question and integrating this information to formulate the final answer. We employed ChatGPT to create training data for CoN, which was subsequently trained on an LLaMa-2 7B model. Our experiments across four open-domain QA benchmarks show that RALMs equipped with CoN significantly outperform standard RALMs. Notably, CoN achieves an average improvement of +7.9 in EM score given entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions that fall outside the pre-training knowledge scope.
翻译:检索增强语言模型(RALMs)通过利用外部知识源,在减少事实性幻觉方面显著提升了大型语言模型的能力。然而,检索信息的可靠性并非始终有保障。检索到不相关数据可能导致误导性答案,甚至使模型忽略其内在知识——即使它本已掌握足够信息来回应查询。此外,标准RALMs通常难以评估自身是否具备充足的知识(包括内在知识与检索知识)以提供准确回答。当知识匮乏时,这些系统理想情况下应在无法获得答案时回复“未知”。针对这些挑战,我们提出了Chain-of-Note(CoN),一种旨在提高RALMs在面对噪声文档、不相关文档以及未知场景时鲁棒性的新方法。CoN的核心思路是为检索到的文档生成连续阅读笔记,从而全面评估其与给定问题的相关性,并将这些信息整合以形成最终答案。我们使用ChatGPT生成CoN训练数据,随后在LLaMa-2 7B模型上进行训练。在四个开放域问答基准上的实验表明,配备CoN的RALMs显著优于标准RALMs。值得注意的是,在检索文档完全为噪声的情况下,CoN的EM得分平均提升+7.9;对于超出预训练知识范围的实时问题,其拒绝率提升+10.5。