Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FActScore (Factual precision in Atomicity Score), a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FActScores of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FActScore, using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models.
翻译:评估大语言模型(LMs)生成的长文本的事实准确性是一个具有挑战性的任务,因为(1)生成内容通常混合了可支持和不可支持的信息片段,导致二元质量评判不充分;(2)人工评估既耗时又昂贵。本文提出FActScore(原子性评分中的事实精确性),这是一种新的评估方法,将生成文本分解为一系列原子事实,并计算被可靠知识源支持的原子事实所占百分比。我们通过广泛的人工评估,获得了多个最先进商用语言模型(InstructGPT、ChatGPT及检索增强型PerplexityAI)生成的人物传记的FActScore,并报告了新分析结果,证明需要这种细粒度评分(例如ChatGPT仅达到58%)。由于人工评估成本高昂,我们还引入了一个自动化模型,利用检索和强语言模型估算FActScore,其错误率低于2%。最后,我们使用该自动化指标评估了来自13个最新语言模型的6500个生成结果(若由人工评估将花费2.6万美元),得出多项发现:GPT-4和ChatGPT的事实准确性高于公开模型,而Vicuna和Alpaca是表现最佳的公开模型之一。