Long-form generations from large language models (LLMs) contains a mix of factual and non-factual claims, making evaluating factuality difficult. To evaluate factual precision of long-form generations in a more fine-grained way, prior works propose to decompose long-form generations into multiple verifiable facts and verify those facts independently. The factuality of the generation is the proportion of verifiable facts among all the facts. Such methods assume that combining factual claims forms a factual paragraph. This paper shows that the assumption can be violated due to entity ambiguity. We show that LLMs can generate paragraphs that contain verifiable facts, but the facts are combined to form a non-factual paragraph due to entity ambiguity. We further reveal that existing factual precision metrics, including FActScore and citation recall, cannot properly evaluate the factuality of these non-factual paragraphs. To address this, we introduce an enhanced metric, D-FActScore, specifically designed for content with ambiguous entities. We evaluate the D-FActScores of people biographies generated with retrieval-augmented generation (RAG). We show that D-FActScore can better assess the factuality of paragraphs with entity ambiguity than FActScore. We also find that four widely used open-source LLMs tend to mix information of distinct entities to form non-factual paragraphs.
翻译:大型语言模型(LLMs)生成的长文本包含事实性与非事实性主张的混合,这使得评估事实性变得困难。为了更细致地评估长文本生成的事实精确性,先前的研究提出将长文本分解为多个可验证的事实,并独立验证这些事实。生成文本的事实性定义为所有事实中可验证事实的比例。这类方法假设组合事实性主张即可形成事实性段落。本文表明,该假设可能因实体歧义而被违背。我们证明,LLMs能够生成包含可验证事实的段落,但实体歧义导致这些事实被组合成非事实性段落。我们进一步揭示,现有的事实精确性指标(包括FActScore和引用召回率)无法正确评估这类非事实性段落的事实性。为解决此问题,我们引入增强型指标D-FActScore,专门设计用于处理包含歧义实体的内容。我们评估了基于检索增强生成(RAG)生成的人物传记的D-FActScore值。研究表明,与FActScore相比,D-FActScore能更好地评估存在实体歧义的段落的事实性。我们还发现,四种广泛使用的开源LLMs倾向于混合不同实体的信息以形成非事实性段落。