Open domain question answering systems frequently rely on information retrieved from large collections of text (such as the Web) to answer questions. However, such collections of text often contain conflicting information, and indiscriminately depending on this information may result in untruthful and inaccurate answers. To understand the gravity of this problem, we collect a human-annotated dataset, Question Answering with Conflicting Contexts (QACC), and find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We evaluate and benchmark three powerful Large Language Models (LLMs) with our dataset QACC and demonstrate their limitations in effectively addressing questions with conflicting information. To explore how humans reason through conflicting contexts, we request our annotators to provide explanations for their selections of correct answers. We demonstrate that by finetuning LLMs to explain their answers, we can introduce richer information into their training that guide them through the process of reasoning with conflicting contexts.
翻译:开放域问答系统通常依赖于从大规模文本集合(如互联网)中检索信息来回答问题。然而,此类文本集合常包含相互矛盾的信息,若不加甄别地依赖这些信息,可能导致答案不真实或不准确。为深入理解该问题的严重性,我们构建了一个人工标注的数据集——冲突上下文问答数据集(QACC),并发现当使用谷歌搜索进行检索时,高达25%的明确性开放域问题会引发上下文冲突。我们基于QACC数据集对三种强大的大语言模型(LLM)进行评估与基准测试,揭示了它们在处理含冲突信息问题时的局限性。为探究人类如何通过冲突上下文进行推理,我们要求标注者为所选正确答案提供解释。研究表明,通过对LLM进行答案解释的微调,我们能够为其训练引入更丰富的信息,从而引导其在冲突上下文情境下完成推理过程。