Recent advancements in large language models (LLMs) have considerably advanced the capabilities of summarization systems. However, they continue to face concerns about hallucinations. While prior work has evaluated LLMs extensively in news domains, most evaluation of dialogue summarization has focused on BART-based models, leaving a gap in our understanding of their faithfulness. Our work benchmarks the faithfulness of LLMs for dialogue summarization, using human annotations and focusing on identifying and categorizing span-level inconsistencies. Specifically, we focus on two prominent LLMs: GPT-4 and Alpaca-13B. Our evaluation reveals subtleties as to what constitutes a hallucination: LLMs often generate plausible inferences, supported by circumstantial evidence in the conversation, that lack direct evidence, a pattern that is less prevalent in older models. We propose a refined taxonomy of errors, coining the category of "Circumstantial Inference" to bucket these LLM behaviors and release the dataset. Using our taxonomy, we compare the behavioral differences between LLMs and older fine-tuned models. Additionally, we systematically assess the efficacy of automatic error detection methods on LLM summaries and find that they struggle to detect these nuanced errors. To address this, we introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics, particularly for identifying "Circumstantial Inference."
翻译:近年来,大型语言模型(LLM)显著提升了摘要系统的能力。然而,它们仍面临幻觉问题的挑战。尽管已有研究在新闻领域对LLM进行了广泛评估,但对话摘要的评估大多集中于基于BART的模型,导致对其忠实性的理解存在空白。本研究通过人工标注,以识别和分类跨度级别的不一致性为重点,对LLM在对话摘要任务中的忠实性进行基准测试。具体而言,我们聚焦两个代表性LLM:GPT-4和Alpaca-13B。评估揭示了幻觉定义的微妙性:LLM常生成基于对话情境证据的合理推断,但缺乏直接证据支撑——这种模式在旧模型中较为少见。我们提出一种改进的错误分类体系,将此类LLM行为归入新类别“情境性推断”,并发布相应数据集。基于该分类体系,我们比较了LLM与旧微调模型的行为差异。此外,我们系统评估了自动错误检测方法在LLM摘要中的有效性,发现这些方法难以识别此类细微错误。为解决此问题,我们提出两种基于提示的细粒度错误检测方法,其性能优于现有指标,尤其在识别“情境性推断”方面表现突出。