Large language models (LMs) are prone to generate diverse factually incorrect statements, which are widely called hallucinations. Current approaches predominantly focus on coarse-grained automatic hallucination detection or editing, overlooking nuanced error levels. In this paper, we propose a novel task -- automatic fine-grained hallucination detection -- and present a comprehensive taxonomy encompassing six hierarchically defined types of hallucination. To facilitate evaluation, we introduce a new benchmark that includes fine-grained human judgments on two LM outputs across various domains. Our analysis reveals that ChatGPT and Llama 2-Chat exhibit hallucinations in 60% and 75% of their outputs, respectively, and a majority of these hallucinations fall into categories that have been underexplored. As an initial step to address this, we train FAVA, a retrieval-augmented LM by carefully designing synthetic data generations to detect and correct fine-grained hallucinations. On our benchmark, our automatic and human evaluations show that FAVA significantly outperforms ChatGPT on fine-grained hallucination detection by a large margin though a large room for future improvement still exists. FAVA's suggested edits also improve the factuality of LM-generated text, resulting in 5-10% FActScore improvements.
翻译:大型语言模型(LMs)容易生成多种事实不正确的陈述,这种现象通常被称为幻觉。当前方法主要集中在粗粒度的自动幻觉检测或编辑上,忽略了细微的错误层级。在本文中,我们提出了一项新任务——自动细粒度幻觉检测,并提出了一个全面的分类体系,包含六种层次化定义的幻觉类型。为便于评估,我们引入了一个新基准,涵盖了两个语言模型输出在不同领域中的细粒度人工判断。我们的分析显示,ChatGPT和Llama 2-Chat分别在其60%和75%的输出中表现出幻觉,且这些幻觉大多属于尚未被充分探索的类别。作为初步解决方案,我们训练了FAVA,一种通过精心设计合成数据生成来检测和纠正细粒度幻觉的检索增强语言模型。在我们的基准上,自动和人工评估表明,尽管未来仍有很大改进空间,但FAVA在细粒度幻觉检测上大幅优于ChatGPT。FAVA提出的编辑建议还改善了语言模型生成文本的事实准确性,使FActScore提升了5-10%。