向用户传授大语言模型的错误模式并实现正确使用 (Teaching People LLM's Errors and Getting it Right)

People use large language models (LLMs) when they should not. This is partly because they see LLMs compose poems and answer intricate questions, so they understandably, but incorrectly, assume LLMs won't stumble on basic tasks like simple arithmetic. Prior work has tried to address this by clustering instance embeddings into regions where an LLM is likely to fail and automatically describing patterns in these regions. The found failure patterns are taught to users to mitigate their overreliance. Yet, this approach has not fully succeeded. In this analysis paper, we aim to understand why. We first examine whether the negative result stems from the absence of failure patterns. We group instances in two datasets by their meta-labels and evaluate an LLM's predictions on these groups. We then define criteria to flag groups that are sizable and where the LLM is error-prone, and find meta-label groups that meet these criteria. Their meta-labels are the LLM's failure patterns that could be taught to users, so they do exist. We next test whether prompting and embedding-based approaches can surface these known failures. Without this, users cannot be taught about them to reduce their overreliance. We find mixed results across methods, which could explain the negative result. Finally, we revisit the final metric that measures teaching effectiveness. We propose to assess a user's ability to effectively use the given failure patterns to anticipate when an LLM is error-prone. A user study shows a positive effect from teaching with this metric, unlike the human-AI team accuracy. Our findings show that teaching failure patterns could be a viable approach to mitigating overreliance, but success depends on better automated failure-discovery methods and using metrics like ours.

翻译：人们时常在不该使用大语言模型（LLM）的场景中使用了它们。部分原因在于，当看到大语言模型能够创作诗歌并回答复杂问题时，用户会自然而然地（尽管错误地）认为它们在简单算术等基础任务上也不会出错。先前的研究试图通过将实例嵌入聚类到模型可能失败的区域，并自动描述这些区域的特征模式来解决此问题。这些发现的失败模式被传授给用户，以减轻他们的过度依赖。然而，该方法尚未完全成功。在本分析论文中，我们旨在探究其原因。我们首先检验负面结果是否源于失败模式的缺失。我们通过元标签对两个数据集中的实例进行分组，并评估大语言模型在这些组上的预测表现。随后，我们设定标准来标记规模较大且模型易出错的组别，并找到了符合这些标准的元标签组。这些元标签即构成可传授给用户的大语言模型失败模式，说明此类模式确实存在。接着，我们测试基于提示和嵌入的方法能否有效揭示这些已知的失败案例。若无法实现，用户将无法通过了解这些模式来降低过度依赖。我们发现不同方法的效果存在差异，这或许能解释先前研究的负面结果。最后，我们重新审视衡量教学效果的最终指标。我们提出应评估用户运用给定失败模式来预判大语言模型何时易出错的能力。一项用户研究表明，采用该指标进行教学能产生积极效果，而传统的人机协作准确率指标则未能体现此优势。我们的研究结果表明，传授失败模式可能是缓解过度依赖的有效途径，但其成功取决于更优的自动化失败发现方法以及采用我们提出的评估指标。