Current state-of-the-art language models (LMs) are notorious for generating text with "hallucinations," a primary example being book and paper references that lack any solid basis in their training data. However, we find that many of these fabrications can be identified using the same LM, using only black-box queries without consulting any external resources. Consistency checks done with direct queries about whether the generated reference title is real (inspired by Kadavath et al. 2022, Lin et al. 2022, Manakul et al. 2023) are compared to consistency checks with indirect queries which ask for ancillary details such as the authors of the work. These consistency checks are found to be partially reliable indicators of whether or not the reference is a hallucination. In particular, we find that LMs in the GPT-series will hallucinate differing authors of hallucinated references when queried in independent sessions, while it will consistently identify authors of real references. This suggests that the hallucination may be more a result of generation techniques than the underlying representation.
翻译:当前最先进的语言模型(LMs)因生成包含“幻觉”的文本而闻名,一个典型例子是生成在训练数据中缺乏可靠依据的书籍和论文参考文献。然而,我们发现,仅通过黑盒查询、不借助任何外部资源,即可利用同一语言模型识别出许多此类虚构内容。我们比较了两种一致性检验方法:直接查询生成的参考文献标题是否真实(受Kadavath等人2022年、Lin等人2022年、Manakul等人2023年启发),以及间接查询要求提供作品的作者等辅助细节。研究发现,这些一致性检验在判断参考文献是否为幻觉方面具有部分可靠性。特别地,我们观察到GPT系列语言模型在独立会话中查询时,会为虚构的参考文献生成不同的作者,而对真实参考文献的作者则能保持一致识别。这表明幻觉现象可能更多源于生成技术,而非底层表征。