Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.
翻译:大多数大型语言模型(LLMs)仅训练一次并从未更新,因此缺乏动态适应持续变化世界的能力。本研究围绕回答需要测试当前世界知识的问题场景,对LLM生成文本的事实性进行了详细研究。具体而言,我们提出了FreshQA——一个新颖的动态问答基准,涵盖多种问题与答案类型,包括需要快速变化世界知识的问题以及需要驳斥的虚假前提问题。我们在双模式评估流程下对多种闭源和开源LLM进行了基准测试,该流程可同时衡量正确性与幻觉程度。通过涉及超过5万次判断的人工评估,我们揭示了这些模型的局限性,并展示出显著的改进空间:例如,所有模型(无论规模大小)在涉及快速变化知识和虚假前提的问题上均表现困难。基于这些发现,我们提出了FreshPrompt——一种简单的少样本提示方法,通过将搜索引擎检索到的相关且最新的信息整合至提示中,显著提升了LLM在FreshQA上的性能。实验表明,FreshPrompt不仅优于Self-Ask(Press等,2022)等竞争性搜索引擎增强提示方法,也超越了Perplexity.AI等商业系统。对FreshPrompt的进一步分析显示,检索证据的数量及其顺序对LLM生成答案的正确性具有关键影响;此外,指导LLM生成简洁直接的答案相比鼓励冗长回答更能减少幻觉现象。为促进后续研究,我们在github.com/freshllms/freshqa开源了FreshQA,并承诺定期更新。