Plausible, but inaccurate, tokens in model-generated text are widely believed to be pervasive and problematic for the responsible adoption of language models. Despite this concern, there is little scientific work that attempts to measure the prevalence of language model hallucination in a comprehensive way. In this paper, we argue that language models should be evaluated using repeatable, open, and domain-contextualized hallucination benchmarking. We present a taxonomy of hallucinations alongside a case study that demonstrates that when experts are absent from the early stages of data creation, the resulting hallucination metrics lack validity and practical utility.
翻译:模型生成文本中看似合理但不准确的词汇被普遍认为是广泛存在且对语言模型负责任应用构成挑战的问题。尽管存在这一担忧,目前仍缺乏系统性测量语言模型幻觉普遍性的科学研究。本文主张,语言模型的评估应基于可重复、开放且结合领域情境的幻觉基准测试。我们提出了一套幻觉分类体系,并通过案例研究表明,若在数据创建的早期阶段缺乏专家参与,所得到的幻觉度量将缺乏有效性与实际应用价值。