Knowledge-intensive tasks (e.g., open-domain question answering (QA)) require a substantial amount of factual knowledge and often rely on external information for assistance. Recently, large language models (LLMs) (e.g., ChatGPT), have demonstrated impressive prowess in solving a wide range of tasks with world knowledge, including knowledge-intensive tasks. However, it remains unclear how well LLMs are able to perceive their factual knowledge boundaries, particularly how they behave when incorporating retrieval augmentation. In this study, we present an initial analysis of the factual knowledge boundaries of LLMs and how retrieval augmentation affects LLMs on open-domain QA. Specially, we focus on three primary research questions and analyze them by examining QA performance, priori judgement and posteriori judgement of LLMs. We show evidence that LLMs possess unwavering confidence in their capabilities to respond to questions and the accuracy of their responses. Furthermore, retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries, thereby improving their judgemental abilities. Additionally, we also find that LLMs have a propensity to rely on the provided retrieval results when formulating answers, while the quality of these results significantly impacts their reliance. The code to reproduce this work is available at https://github.com/RUCAIBox/LLM-Knowledge-Boundary.
翻译:知识密集型任务(如开放域问答)需要大量事实知识,且通常依赖外部信息辅助。近期,大语言模型(如ChatGPT)凭借其世界知识在包括知识密集型任务在内的广泛任务中展现出卓越能力。然而,大语言模型对其事实知识边界的感知能力,特别是在引入检索增强时的行为机制,仍尚不明确。本研究首次系统分析了大语言模型的事实知识边界,以及检索增强如何影响其在开放域问答中的表现。具体而言,我们聚焦三个核心研究问题,通过评估大语言模型的问答性能、先验判断与后验判断展开分析。实验证据表明,大语言模型对自身应答能力及答案准确性抱有坚定信心。进一步发现,检索增强是提升大语言模型知识边界感知能力的有效方法,能显著改善其判断能力。此外,我们还发现大语言模型在生成答案时倾向于依赖提供的检索结果,而检索结果的质量会显著影响其依赖程度。重现本研究的代码已开源至https://github.com/RUCAIBox/LLM-Knowledge-Boundary。