从特征到行动：传统与智能体化人工智能系统中的可解释性 (From Features to Actions: Explainability in Traditional and Agentic AI Systems)

Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. While useful, it remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics across both settings. To make this distinction explicit, we empirically compare attribution-based explanations used in static classification tasks with trace-based diagnostics used in agentic benchmarks (TAU-bench Airline and AssistantBench). Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman $ρ= 0.86$), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7$\times$ more prevalent in failed runs and reduces success probability by 49\%. These findings motivate a shift towards trajectory-level explainability for agentic systems when evaluating and diagnosing autonomous AI behaviour. Resources: https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework

翻译：过去十年间，可解释人工智能主要聚焦于解释单一模型预测，通过事后归因方法在固定决策结构下建立输入与输出的关联。近期大语言模型（LLMs）的发展催生了行为通过多步轨迹展开的智能体化AI系统。在此类场景中，成败由决策序列而非单一输出决定。尽管现有解释方法具有实用价值，但针对静态预测设计的解释范式如何迁移至行为随时间演化的智能体场景仍不明确。本研究通过对比静态与智能体场景下的归因解释与轨迹诊断，弥合了两类可解释性范式的鸿沟。为明确区分二者，我们实证比较了静态分类任务中使用的归因解释方法与智能体基准测试（TAU-bench Airline 与 AssistantBench）中采用的轨迹诊断方法。实验结果表明：归因方法在静态场景中能获得稳定的特征排序（Spearman $ρ= 0.86$），但无法可靠诊断智能体轨迹中的执行级故障；相比之下，基于轨迹的智能体评估准则能持续定位行为故障点，并揭示状态跟踪不一致性在失败案例中的出现频率是成功案例的2.7倍，且使成功概率降低49%。这些发现表明，在评估和诊断自主AI行为时，需向轨迹级可解释性范式转变。资源链接：https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework