Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.
翻译:近年来,在事实一致性(或幻觉)评估研究的推动下,单文档新闻摘要的忠实度取得了显著进展。我们探讨这些进展是否适用于其他文本摘要领域。为此,我们提出了一个针对主题聚焦对话摘要的新评估基准,该摘要由不同规模的LLM生成。我们提供了这些摘要事实一致性的二值句子级人工标注,并对事实不一致的句子给出了详细解释。分析表明,现有LLM在对话领域会生成大量事实错误,且与模型规模无关。另一方面,当包括GPT-4在内的LLM作为二值事实评估器时,其表现较差,甚至可以被当前最先进的特化事实性评估指标超越。最后,我们基于分类的错误分类法对幻觉类型进行了分析。研究发现,模型生成的摘要中存在多种错误和错误分布,且非基于LLM的指标比基于LLM的评估器能更好地捕捉所有错误类型。