While we increasingly rely on large language models (LLMs) for various tasks, these models are known to produce inaccurate content or 'hallucinations' with potentially disastrous consequences. The recent integration of web search results into LLMs prompts the question of whether people utilize them to verify the generated content, thereby avoiding falling victim to hallucinations. This study (N = 560) investigated how the provision of search results, either static (fixed search results) or dynamic (participant-driven searches), affect participants' perceived accuracy and confidence in evaluating LLM-generated content (i.e., genuine, minor hallucination, major hallucination), compared to the control condition (no search results). Findings indicate that participants in both static and dynamic conditions (vs. control) rated hallucinated content to be less accurate. However, those in the dynamic condition rated genuine content as more accurate and demonstrated greater overall confidence in their assessments than those in the static or control conditions. In addition, those higher in need for cognition (NFC) rated major hallucinations to be less accurate than low NFC participants, with no corresponding difference for genuine content or minor hallucinations. These results underscore the potential benefits of integrating web search results into LLMs for the detection of hallucinations, as well as the need for a more nuanced approach when developing human-centered systems, taking user characteristics into account.
翻译:[translated abstract in Chinese]
尽管我们日益依赖大型语言模型(LLM)执行各种任务,但众所周知,这些模型会产生不准确的内容或“幻觉”,可能带来灾难性后果。近期将网络搜索结果整合到LLM中的做法,促使我们思考人们是否会利用这些结果来验证生成内容,从而避免成为幻觉的受害者。本研究(N = 560)调查了,与控制条件(无搜索结果)相比,提供静态(固定搜索结果)或动态(参与者驱动的搜索)的搜索结果,如何影响参与者评估LLM生成内容(即真实内容、轻微幻觉、严重幻觉)时的感知准确性和置信度。研究结果表明,处于静态和动态条件下的参与者(相对于控制条件)认为幻觉内容的准确性较低。然而,处于动态条件下的参与者认为真实内容更准确,并且与处于静态或控制条件下的参与者相比,他们对自身评估表现出更高的总体置信度。此外,认知需求较高的参与者认为严重幻觉的准确性低于认知需求较低的参与者,而在真实内容或轻微幻觉方面则没有相应的差异。这些结果强调了将网络搜索结果整合到LLM中以检测幻觉的潜在益处,以及在开发以人为本的系统时,需要考虑用户特征,采取更细致的方法的必要性。