State-of-the-art language models (LMs) are notoriously susceptible to generating hallucinated information. Such inaccurate outputs not only undermine the reliability of these models but also limit their use and raise serious concerns about misinformation and propaganda. In this work, we focus on hallucinated book and article references and present them as the "model organism" of language model hallucination research, due to their frequent and easy-to-discern nature. We posit that if a language model cites a particular reference in its output, then it should ideally possess sufficient information about its authors and content, among other relevant details. Using this basic insight, we illustrate that one can identify hallucinated references without ever consulting any external resources, by asking a set of direct or indirect queries to the language model about the references. These queries can be considered as "consistency checks." Our findings highlight that while LMs, including GPT-4, often produce inconsistent author lists for hallucinated references, they also often accurately recall the authors of real references. In this sense, the LM can be said to "know" when it is hallucinating references. Furthermore, these findings show how hallucinated references can be dissected to shed light on their nature. Replication code and results can be found at https://github.com/microsoft/hallucinated-references.
翻译:最先进的语言模型(LM)以其容易生成虚假信息而著称。这些不准确的输出不仅削弱了这些模型的可靠性,还限制了它们的应用,并引发了对错误信息和宣传的严重担忧。在本研究中,我们聚焦于模型生成的虚构书籍和文章参考文献,将其视为语言模型幻觉研究的“典型研究对象”,因为这类幻觉频繁出现且易于识别。我们假设,如果语言模型在输出中引用某个参考文献,那么它理想情况下应具备关于其作者和内容等足够信息。基于这一基本见解,我们证明可以通过向语言模型提出一组关于参考文献的直接或间接查询(视为“一致性检查”),无需依赖任何外部资源即可识别出幻觉参考文献。我们的发现强调,尽管包括GPT-4在内的LM经常为幻觉参考文献生成不一致的作者列表,但它们也常常能准确回忆起真实参考文献的作者。从这个意义上说,可以说语言模型“知道”自己何时在生成幻觉参考文献。此外,这些发现展示了如何剖析幻觉参考文献以揭示其本质。复制代码和结果可访问 https://github.com/microsoft/hallucinated-references。