Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via `pip install factscore`.
翻译:评估大语言模型生成的长文本事实性具有挑战性,原因在于:(1)生成内容常包含支持与不支持信息的混合,使得二元质量判断不够充分;(2)人工评估耗时且成本高昂。本文提出FACTSCORE这一新型评估方法,将生成内容分解为一系列原子事实,并计算可靠知识源支持的原子事实百分比。我们通过大规模人工评估,获取了多个最先进商业大语言模型(InstructGPT、ChatGPT及检索增强型PerplexityAI)生成的人物传记的FACTSCORE分数,并报道了新分析结果,证明此类细粒度评分的必要性(例如ChatGPT仅达到58%)。鉴于人工评估成本高昂,我们同时引入自动化模型,利用检索和强语言模型估算FACTSCORE,误差率低于2%。最后,我们使用该自动化指标评估了来自13个近期大语言模型新集合的6500个生成结果(若采用人工评估需花费2.6万美元),主要发现包括:GPT-4和ChatGPT的事实性优于公开模型,而Vicuna和Alpaca属于最佳公开模型之列。FACTSCORE已通过`pip install factscore`公开发布。