To enable human oversight, agentic AI systems often provide a trace of reasoning and action steps. Designing traces to have an informative, but not overwhelming, level of detail remains a critical challenge. In three user studies on a Computer User Agent, we investigate the utility of basic action traces for verification, explore three alternatives via design probes, and test a novel interface's impact on error finding in question-answering tasks. As expected, we find that current practices are cumbersome, limiting their efficacy. Conversely, our proposed design reduced the time participants spent finding errors. However, although participants reported higher levels of confidence in their decisions, their final accuracy was not meaningfully improved. To this end, our study surfaces challenges for human verification of agentic systems, including managing built-in assumptions, users' subjective and changing correctness criteria, and the shortcomings, yet importance, of communicating the agent's process.
翻译:为实现人类监督,智能体人工智能系统通常提供推理与行动步骤的追踪记录。如何设计信息量充足但又不至于过度冗杂的追踪记录,仍然是关键挑战。通过对计算机用户智能体开展的三项用户研究,我们探究了基础行动追踪在验证任务中的效用,通过设计探针探索了三种替代方案,并测试了新型界面在问答任务中对错误发现效率的影响。如预期所示,当前实践方法较为繁琐,限制了其有效性。相反,我们提出的设计方案减少了参与者发现错误所需的时间。然而,尽管参与者报告其决策信心水平有所提高,其最终准确率并未得到实质性改善。为此,本研究揭示了人类验证智能体系统面临的挑战,包括管理内置假设、用户主观且动态变化的正确性标准,以及传达智能体过程存在的缺陷与重要性。