Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.
翻译:近年来,针对事实一致性(即幻觉)评估的研究推动了单文档新闻摘要领域在忠实度方面的显著进展。我们探究这些进展是否能迁移到其他文本摘要领域。本文提出一个由不同规模大语言模型生成的主题聚焦对话摘要评估基准。我们为这些摘要的事实一致性提供二值句子级人工标注,并附带事实不一致句子的详细解释。分析表明,现有大语言模型在对话领域存在大量事实性错误幻觉,且与模型规模无关。另一方面,当包括GPT-4在内的大语言模型作为二值事实评估器时,其表现欠佳,甚至不及当前最先进的事实性评估专用指标。最后,我们基于细化的错误分类法对幻觉类型进行了分析。研究发现,模型生成摘要中错误类型及分布具有多样性,且非大语言模型指标在捕获各类错误方面优于基于大语言模型的评估器。