Evaluating human-AI decision-making systems is an emerging challenge as new ways of combining multiple AI models towards a specific goal are proposed every day. As humans interact with AI in decision-making systems, multiple factors may be present in a task including trust, interpretability, and explainability, amongst others. In this context, this work proposes a retrospective method to support a more holistic understanding of how people interact with and connect multiple AI models and combine multiple outputs in human-AI decision-making systems. The method consists of employing a retrospective end-user walkthrough with the objective of providing support to HCI practitioners so that they may gain an understanding of the higher order cognitive processes in place and the role that AI model outputs play in human-AI decision-making. The method was qualitatively assessed with 29 participants (four participants in a pilot phase; 25 participants in the main user study) interacting with a human-AI decision-making system in the context of financial decision-making. The system combines visual analytics, three AI models for revenue prediction, AI-supported analogues analysis, and hypothesis testing using external news and natural language processing to provide multiple means for comparing companies. Beyond results on tasks and usability problems, outcomes presented suggest that the method is promising in highlighting why AI models are ignored, used, or trusted, and how future interactions are planned. We suggest that HCI practitioners researching human-AI interaction can benefit by adding this step to user studies in a debriefing stage as a retrospective Thinking-Aloud protocol would be applied, but with emphasis on revisiting tasks and understanding why participants ignored or connected predictions while performing a task.
翻译:评估人机协同决策系统正成为新兴挑战,因为每天都有结合多个AI模型以实现特定目标的新方法被提出。当人类与AI在决策系统中交互时,任务中可能涉及多种因素,包括信任、可解释性和可说明性等。在此背景下,本文提出一种回顾性方法,旨在更全面地理解人们如何在人机协同决策系统中与多个AI模型互动、连接这些模型并整合其输出结果。该方法采用回顾式最终用户走查,目标是支持人机交互从业者理解决策过程中的高阶认知机制,以及AI模型输出在人机协作决策中的角色。我们通过29名参与者(试点阶段4名,主用户研究25名)在金融决策场景中与某人机协同决策系统交互,对该方法进行了定性评估。该系统结合了可视化分析、三个收入预测AI模型、AI支持的类比分析,以及利用外部新闻和自然语言处理进行假设检验,提供多种企业比较手段。除任务分析和可用性问题外,研究结果表明该方法在揭示AI模型被忽略、使用或信任的原因,以及未来交互规划方式方面具有应用潜力。建议从事人机交互研究的从业者可在用户研究的汇报阶段增设此步骤——类似于应用回顾式出声思维协议,但重点在于重新审视任务,理解参与者在执行任务时为何忽略或关联不同预测结果。