Hallucination in Natural Language Generation (NLG) is like the elephant in the room, obvious but often overlooked until recent achievements significantly improved the fluency and grammatical accuracy of generated text. For Large Language Models (LLMs), hallucinations can happen in various downstream tasks and casual conversations, which need accurate assessment to enhance reliability and safety. However, current studies on hallucination evaluation vary greatly, and people still find it difficult to sort out and select the most appropriate evaluation methods. Moreover, as NLP research gradually shifts to the domain of LLMs, it brings new challenges to this direction. This paper provides a comprehensive survey on the evolvement of hallucination evaluation methods, aiming to address three key aspects: 1) Diverse definitions and granularity of facts; 2) The categories of automatic evaluators and their applicability; 3) Unresolved issues and future directions.
翻译:自然语言生成(NLG)中的幻觉就像房间里的大象,显而易见却常被忽略,直到近期取得的显著进展大幅提升了生成文本的流畅度和语法准确性。对于大型语言模型(LLMs),其幻觉可能出现在各类下游任务及日常对话中,需要精确评估以增强可靠性和安全性。然而,当前关于幻觉评估的研究差异显著,人们仍难以梳理并选择最合适的评估方法。此外,随着NLP研究逐步转向LLMs领域,这一方向也面临新的挑战。本文对幻觉评估方法的演变进行了全面综述,旨在探讨三个关键方面:1)事实的多样化定义与粒度;2)自动评估器的类别及其适用性;3)未解决问题与未来方向。