Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation

Modern LLMs can now produce highly readable abstractive summaries, to the point where traditional automated metrics for evaluating summary quality, such as ROUGE, have become saturated. However, LLMs still sometimes introduce unwanted content into summaries, i.e., information inconsistent with or unsupported by their source. Measuring the occurrence of these often subtle ``hallucinations'' automatically has proved to be challenging. This in turn has motivated development of a variety of metrics intended to measure the factual consistency of generated summaries against their source. But are these approaches measuring what they purport to do? In this work, we stress-test automatic factuality metrics. Specifically, we investigate whether and to what degree superficial attributes of summary texts suffice to predict ``factuality'', finding that a (supervised) model using only such shallow features is reasonably competitive with SOTA factuality scoring methods. We then evaluate how factuality metrics respond to factual corrections in inconsistent summaries and find that only a few show meaningful improvements. In contrast, some metrics are more sensitive to benign, non-factual edits. Motivated by these insights, we show that one can ``game'' (most) automatic factuality metrics, i.e., reliably inflate ``factuality'' scores by appending innocuous sentences to generated summaries.Taken together, our results raise questions about the degree to which we should rely on existing automated factuality metrics and what exactly we want ``factuality metrics'' to measure.

翻译：现代大型语言模型（LLM）现已能生成高度可读的抽象摘要，其质量已使ROUGE等传统自动摘要评估指标趋于饱和。然而，LLM有时仍会在摘要中引入非期望内容，即与源文本不一致或缺乏支持的信息。自动检测这些通常微妙的"幻觉"现象已被证明具有挑战性。这进而推动了多种旨在衡量生成摘要相对于源文本事实一致性的度量标准的发展。但这些方法是否真正测量了它们声称要测量的内容？本研究对自动事实性度量标准进行了压力测试。具体而言，我们探究摘要文本的表层特征是否足以预测"事实性"及其预测程度，发现仅使用此类浅层特征的（监督）模型与最先进的事实性评分方法具有相当的竞争力。随后我们评估事实性度量标准对不一致摘要中事实修正的响应，发现仅少数方法展现出有意义的改进。相比之下，某些度量标准对良性的非事实性编辑更为敏感。基于这些发现，我们证明可以"操纵"（大多数）自动事实性度量标准，即通过向生成摘要附加无关紧要的句子来可靠地提升"事实性"分数。综合而言，我们的研究结果引发了对现有自动事实性度量标准依赖程度的质疑，以及对"事实性度量标准"究竟应衡量何种本质的深层思考。