Despite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today's frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. We argue that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference "world", distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we also connect our framework to HalluWorld, a complementary benchmark that instantiates fully specified reference world models for stress-testing model hallucinations.
翻译:自语言模型诞生以来,尽管人们已尝试多种缓解措施,但幻觉问题在当今前沿的大型语言模型中依然顽固存在。其根源何在?我们回顾了现有对幻觉的多种定义,并将其整合为一个统一概念框架,使得先前的各种定义在此框架下得以归并。我们认为,可通过将其简洁地定义为一种对(内在)世界模型的不准确建模(且以用户可观察的形式呈现),由此实现幻觉定义的统一。例如,陈述与知识库相矛盾的事实,或生成与原文相矛盾的摘要。通过改变参考世界模型及其冲突策略,我们的框架统一了先前的定义。我们主张这一统一视角颇具价值,原因在于:它迫使评估明晰其假定的参考“世界”,将真正的幻觉与规划错误或奖励误差相区分,并为跨基准的比较与缓解策略的探讨提供了共同语言。基于这一定义,我们还将框架与HalluWorld(一个用于全面测试模型幻觉的基准测试,其示例化了完全指定的参考世界模型)相关联。