Recent advances in the development of large language models are rapidly changing how online applications function. LLM-based search tools, for instance, offer a natural language interface that can accommodate complex queries and provide detailed, direct responses. At the same time, there have been concerns about the veracity of the information provided by LLM-based tools due to potential mistakes or fabrications that can arise in algorithmically generated text. In a set of online experiments we investigate how LLM-based search changes people's behavior relative to traditional search, and what can be done to mitigate overreliance on LLM-based output. Participants in our experiments were asked to solve a series of decision tasks that involved researching and comparing different products, and were randomly assigned to do so with either an LLM-based search tool or a traditional search engine. In our first experiment, we find that participants using the LLM-based tool were able to complete their tasks more quickly, using fewer but more complex queries than those who used traditional search. Moreover, these participants reported a more satisfying experience with the LLM-based search tool. When the information presented by the LLM was reliable, participants using the tool made decisions with a comparable level of accuracy to those using traditional search, however we observed overreliance on incorrect information when the LLM erred. Our second experiment further investigated this issue by randomly assigning some users to see a simple color-coded highlighting scheme to alert them to potentially incorrect or misleading information in the LLM responses. Overall we find that this confidence-based highlighting substantially increases the rate at which users spot incorrect information, improving the accuracy of their overall decisions while leaving most other measures unaffected.
翻译:大语言模型的近期发展正在迅速改变在线应用的功能。例如,基于大语言模型的搜索工具提供了一种自然语言界面,能够处理复杂查询并提供详细、直接的回应。与此同时,由于算法生成的文本中可能出现错误或编造,人们一直担忧基于大语言模型工具所提供信息的准确性。在一系列在线实验中,我们研究了与传统搜索相比,基于大语言模型的搜索如何改变人们的行为,以及如何减轻对基于大语言模型输出的过度依赖。实验参与者被要求解决一系列涉及研究和比较不同产品的决策任务,并随机分配使用基于大语言模型的搜索工具或传统搜索引擎。在第一个实验中,我们发现,与使用传统搜索的参与者相比,使用基于大语言模型工具的参与者能够更快完成任务,且使用的查询次数更少但更为复杂。此外,这些参与者报告了对基于大语言模型搜索工具的更高满意度。当大语言模型提供的信息可靠时,使用该工具的参与者做出的决策准确性与使用传统搜索的参与者相当,但在大语言模型出错时,我们观察到了对错误信息的过度依赖。第二个实验进一步探究了这一问题,随机分配部分用户看到一个简单的颜色编码高亮方案,以提醒他们大语言模型回应中可能错误或误导的信息。总体而言,我们发现这种基于置信度的高亮方案显著提高了用户发现错误信息的比例,从而提升了整体决策的准确性,同时大多数其他指标未受影响。