Deployed artificial intelligence (AI) often impacts humans, and there is no one-size-fits-all metric to evaluate these tools. Human-centered evaluation of AI-based systems combines quantitative and qualitative analysis and human input. It has been explored to some depth in the explainable AI (XAI) and human-computer interaction (HCI) communities. Gaps remain, but the basic understanding that humans interact with AI and accompanying explanations, and that humans' needs -- complete with their cognitive biases and quirks -- should be held front and center, is accepted by the community. In this paper, we draw parallels between the relatively mature field of XAI and the rapidly evolving research boom around large language models (LLMs). Accepted evaluative metrics for LLMs are not human-centered. We argue that many of the same paths tread by the XAI community over the past decade will be retread when discussing LLMs. Specifically, we argue that humans' tendencies -- again, complete with their cognitive biases and quirks -- should rest front and center when evaluating deployed LLMs. We outline three developed focus areas of human-centered evaluation of XAI: mental models, use case utility, and cognitive engagement, and we highlight the importance of exploring each of these concepts for LLMs. Our goal is to jumpstart human-centered LLM evaluation.
翻译:部署的人工智能(AI)常对人类产生影响,但尚无放之四海而皆准的指标来评估这些工具。以人为中心的AI系统评估结合了定量分析、定性分析及人工输入,已在可解释人工智能(XAI)与人机交互(HCI)领域得到一定深度的探索。尽管仍有空白,但学界已达成基本共识:人类与AI及其配套解释的互动过程,以及人类的需求(包括其认知偏差与特性)应被置于核心地位。本文通过类比相对成熟的XAI领域与快速演进的大语言模型(LLMs)研究热潮,指出现有LLM评估指标缺乏以人为中心的设计。我们认为,XAI领域过去十年走过的许多路径将在LLM讨论中被重新经历。具体而言,部署的LLM评估应将人类倾向(同样涵盖其认知偏差与特性)置于核心。我们梳理了以人为中心的XAI评估中三个成熟关注领域:心智模型、用例效用性及认知参与度,并强调探索这些概念对LLM的重要性。本文旨在推动以人为中心的LLM评估进程。