Beyond the Black Box: Interpretability of Agentic AI Tool Use

AI agents are promising for high-stakes enterprise workflows, but dependable deployment remains limited because tool-use failures are difficult to diagnose and control. Agents may skip required tool calls, invoke tools unnecessarily, or take actions whose consequence becomes visible only after execution. Existing observability methods are external: prompts reveal correlations, evaluations score outputs, and logs arrive only after the model has already acted. In long-horizon settings, these failures are costly because an early tool mistake can alter the rest of the trajectory, increase token consumption, and create downstream safety and security risk. We introduce a mechanistic-interpretability toolkit built on Sparse Autoencoders (SAEs), which decompose activations into sparse internal features, and linear probes, lightweight classifiers that read signals from those features. The framework reads model states before each action and infers whether a tool is needed and how risky the next tool action is. It identifies the model layers and features most associated with tool decisions and tests their functional importance through feature ablation. We train the probes on multi-step trajectories from the NVIDIA Nemotron function-calling dataset and apply the same workflow to GPT-OSS 20B and Gemma 3 27B models. The goal is not to replace external evaluation, but to add a missing layer: visibility into what the model signaled internally before action. This helps surface deeper causes of agent failure, especially in long-horizon runs where an early mistake can impact subsequent agent behavior. More broadly, the paper shows how mechanistic interpretability can support internal observability for monitoring tool calls and risk in agent systems.

翻译：AI智能体在高风险的企业工作流中具有广阔前景，但由于工具使用故障难以诊断和控制，其可靠部署仍受限制。智能体可能跳过所需的工具调用、不必要地调用工具，或执行在行动完成后才显现后果的操作。现有可观测性方法均为外部手段：提示词揭示相关性、评估分数输出结果、日志仅在模型行动后才生成。在长时域场景中，这类故障代价高昂，因为早期的工具错误可能改变后续轨迹、增加令牌消耗，并引发下游安全与安保风险。我们基于稀疏自编码器（SAEs，可将激活分解为稀疏内部特征）与线性探针（从这些特征中读取信号的轻量级分类器）构建了一套机械可解释性工具包。该框架在每次行动前读取模型状态，推断是否需要使用工具以及下一次工具行动的风险程度。它识别与工具决策最相关的模型层和特征，并通过特征消融测试其功能重要性。我们使用NVIDIA Nemotron函数调用数据集中的多步轨迹训练探针，并将相同工作流应用于GPT-OSS 20B和Gemma 3 27B模型。目标并非取代外部评估，而是增加缺失的维度：在行动前洞察模型内部传递的信号。这有助于揭示智能体故障的深层原因，尤其在早期错误可能影响后续智能体行为的长时域运行中。更广泛而言，本文展示了机械可解释性如何通过内部可观测性支持智能体系统中工具调用与风险的监控。