From Features to Actions: Explainability in Traditional and Agentic AI Systems

Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. While useful, it remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics across both settings. To make this distinction explicit, we empirically compare attribution-based explanations used in static classification tasks with trace-based diagnostics used in agentic benchmarks (TAU-bench Airline and AssistantBench). Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman $ρ= 0.86$), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7$\times$ more prevalent in failed runs and reduces success probability by 49\%. These findings motivate a shift towards trajectory-level explainability for agentic systems when evaluating and diagnosing autonomous AI behaviour. Resources: https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework

翻译：过去十年间，可解释人工智能主要聚焦于解释单一模型预测，通过事后归因方法在固定决策结构下建立输入与输出的关联。近期大语言模型（LLMs）的进展催生了行为通过多步轨迹展开的智能体AI系统。在此类场景中，成败由决策序列而非单一输出决定。尽管现有解释方法具有实用性，但针对静态预测设计的解释范式如何迁移至行为随时间演化的智能体场景仍不明确。本研究通过对比归因式解释与轨迹式诊断在两种场景下的表现，弥合了静态可解释性与智能体可解释性之间的鸿沟。为明确区分二者，我们实证比较了静态分类任务中的归因解释方法与智能体基准测试（TAU-bench Airline与AssistantBench）中的轨迹诊断方法。实验结果表明：归因方法在静态场景中能获得稳定的特征排序（Spearman $ρ= 0.86$），但无法可靠诊断智能体轨迹中的执行级故障；相比之下，基于轨迹的智能体评估准则能持续定位行为故障点，并揭示状态追踪不一致性在失败案例中的出现频率是成功案例的2.7倍，且使成功概率降低49%。这些发现表明，在评估和诊断自主AI行为时，需向面向智能体系统的轨迹级可解释性范式转变。资源链接：https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

专知会员服务

28+阅读 · 2月27日

面向大语言模型对齐的机械解释性：进展、挑战与未来方向

专知会员服务

14+阅读 · 2月14日

智能体化人工智能 (Agentic AI) 的前行之路：挑战与机遇

专知会员服务

43+阅读 · 1月8日

自进化人工智能体的全面综述：连接基础模型与终身自主智能系统的新范式

专知会员服务

35+阅读 · 2025年12月28日