As Large Language Models (LLMs) are nondeterministic, the same input can generate different outputs, some of which may be incorrect or hallucinated. If run again, the LLM may correct itself and produce the correct answer. Unfortunately, most LLM-powered systems resort to single results which, correct or not, users accept. Having the LLM produce multiple outputs may help identify disagreements or alternatives. However, it is not obvious how the user will interpret conflicts or inconsistencies. To this end, we investigate how users perceive the AI model and comprehend the generated information when they receive multiple, potentially inconsistent, outputs. Through a preliminary study, we identified five types of output inconsistencies. Based on these categories, we conducted a study (N=252) in which participants were given one or more LLM-generated passages to an information-seeking question. We found that inconsistency within multiple LLM-generated outputs lowered the participants' perceived AI capacity, while also increasing their comprehension of the given information. Specifically, we observed that this positive effect of inconsistencies was most significant for participants who read two passages, compared to those who read three. Based on these findings, we present design implications that, instead of regarding LLM output inconsistencies as a drawback, we can reveal the potential inconsistencies to transparently indicate the limitations of these models and promote critical LLM usage.
翻译:由于大型语言模型(LLM)具有非确定性,相同输入可能产生不同输出,其中一些可能不正确或存在幻觉。若再次运行,LLM可能自我修正并输出正确答案。遗憾的是,大多数基于LLM的系统仅提供单一结果——无论正确与否,用户只能接受。让LLM生成多个输出可能有助于识别分歧或替代方案,但用户如何解读冲突或不一致性尚不明确。为此,我们研究了当用户接收多个可能不一致的输出时,其对AI模型的感知方式及对生成信息的理解程度。通过初步研究,我们识别出五种输出不一致类型。基于这些分类,我们开展了一项研究(N=252),参与者就信息查询问题获取一个或多个LLM生成的段落。研究发现,多个LLM输出中的不一致性降低了参与者对AI能力的感知,但与此同时提高了他们对给定信息的理解程度。具体而言,我们观察到这种不一致性的积极影响在阅读两个段落的参与者中最为显著,相较于阅读三个段落的参与者。基于这些发现,我们提出设计启示:不应将LLM输出的不一致性视为缺陷,而应揭示潜在的不一致性,以透明地指明这些模型的局限性,并促进对LLM的关键性使用。