Current large language models (LLMs) can exhibit near-human levels of performance on many natural language tasks, including open-domain question answering. Unfortunately, they also convincingly hallucinate incorrect answers, so that responses to questions must be verified against external sources before they can be accepted at face value. In this paper, we report a simple experiment to automatically verify generated answers against a corpus. After presenting a question to an LLM and receiving a generated answer, we query the corpus with the combination of the question + generated answer. We then present the LLM with the combination of the question + generated answer + retrieved answer, prompting it to indicate if the generated answer can be supported by the retrieved answer. We base our experiment on questions and passages from the MS MARCO (V1) test collection, exploring three retrieval approaches ranging from standard BM25 to a full question answering stack, including a reader based on the LLM. For a large fraction of questions, we find that an LLM is capable of verifying its generated answer if appropriate supporting material is provided. However, with an accuracy of 70-80%, this approach cannot be fully relied upon to detect hallucinations.
翻译:当前大型语言模型在包括开放域问答在内的许多自然语言任务中表现出接近人类的性能。然而,它们也会令人信服地凭空捏造错误答案,因此在对问题回答进行表面确认之前,必须通过外部来源对其进行验证。本文报告了一个简单的实验,旨在自动验证生成的答案是否与语料库一致。在向大型语言模型提出问题并收到生成答案后,我们使用问题与生成答案的组合来查询语料库。然后,我们将问题、生成答案与检索到的答案的组合呈现给大型语言模型,提示它指出生成答案是否能得到检索答案的支持。我们的实验基于MS MARCO(V1)测试集中的问题和段落,探讨了从标准BM25到完整问答堆栈(包括基于大型语言模型的阅读器)的三种检索方法。对于大部分问题,我们发现,如果提供适当的支撑材料,大型语言模型能够验证其生成的答案。然而,这种方法准确率为70%-80%,无法完全可靠地用于检测模型凭空捏造的答案。