Long-form generations from large language models (LLMs) contain a mix of factual and non-factual claims, making evaluating factuality difficult. Prior works evaluate the factuality of a long paragraph by decomposing it into multiple facts, verifying those facts independently, and aggregating the results. Such methods assume that combining factual claims forms a factual paragraph. The above assumption can be violated: we show that strong open-source models like Llama-chat can generate paragraphs that contain verifiable facts, but the facts are combined into a non-factual paragraph due to entity ambiguity. We further reveal that existing factuality metrics, including FActScore and citation recall, cannot properly evaluate these non-factual paragraphs and overestimate their factuality. To address this, we introduce an enhanced metric, D-FActScore, specifically designed for content with ambiguous entities. We evaluate the D-FActScores of people biographies generated by retrieval-augmented LLMs. We show that D-FActScore can better assess the factuality of paragraphs with entity ambiguity than FActScore. We also find that four widely used open-source LLMs tend to mix information of distinct entities to form non-factual paragraphs, making their D-FActScore much lower than FActScore by over 10%.
翻译:大型语言模型(LLM)生成的长文本混合了事实性与非事实性主张,这使得评估其事实性变得困难。先前的研究通过将长段落分解为多个事实、独立验证这些事实并聚合结果来评估段落的事实性。此类方法假设组合事实性主张即可构成事实性段落。然而,上述假设可能被违背:我们发现,如Llama-chat等强大的开源模型能够生成包含可验证事实的段落,但由于实体模糊性,这些事实被组合成了非事实性段落。我们进一步揭示,包括FActScore和引用召回率在内的现有事实性度量方法无法正确评估这些非事实性段落,并高估了其事实性。为解决此问题,我们引入了一种增强的度量标准——D-FActScore,专门为包含模糊实体的内容设计。我们评估了由检索增强型LLM生成的人物传记的D-FActScore。结果表明,相较于FActScore,D-FActScore能更好地评估具有实体模糊性的段落的事实性。我们还发现,四种广泛使用的开源LLM倾向于混合不同实体的信息以形成非事实性段落,导致其D-FActScore比FActScore低超过10%。