Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.
翻译:大多数大型语言模型(LLMs)仅经单次训练且从未更新,因而缺乏动态适应不断变化世界的能力。本研究针对回答检测当前世界知识的问题场景,系统分析了LLM生成文本的事实性。具体而言,我们提出FreshQA——一个涵盖多样化问答类型的动态基准测试集,包括需要快速变化的世界知识的问题,以及需被揭穿的虚假前提问题。通过双模式评估流程(可同时衡量正确性与幻觉率),我们对多个闭源与开源LLM进行了基准测试。基于超过5万次人工评估,我们揭示了这些模型的局限性并指出显著改进空间:例如,所有模型(无论规模大小)在处理涉及快速变化知识与虚假前提的问题时均表现不佳。受此启发,我们提出FreshPrompt——一种简单的小样本提示方法,通过将搜索引擎检索的相关最新信息整合至提示中,显著提升LLM在FreshQA上的表现。实验表明,FreshPrompt不仅优于Self-Ask(Press等人,2022)等竞争性搜索引擎增强提示方法,还超越了Perplexity.AI等商业系统。进一步分析发现,检索证据的数量与顺序对LLM生成答案的正确性具有关键影响。此外,相较于鼓励生成冗长答案,指导LLM生成简洁直接的回答有助于减少幻觉。为促进后续研究,我们在github.com/freshllms/freshqa公开发布FreshQA,并承诺定期更新。