Tool Calling is Linearly Readable and Steerable in Language Models

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As agents take on consequential actions, one bad tool call can do real damage. We currently have no way to look inside the model and catch the mistake before it happens; this paper shows that we can. Inside the model, the choice of tool is carried by a single direction in activation space, one direction per pair of tools. Adding that direction during generation switches which tool the model picks. Across 12 instruction-tuned and 6 base models spanning Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), this works at 83-100% accuracy on 4B+ instruction-tuned models on a 15-tool synthetic benchmark and at 77-94% on the real-API benchmark $τ$-bench airline. The JSON arguments that follow automatically adapt to the new tool's schema, so flipping the name is enough. The same per-tool directions also flag likely errors before they happen: queries where the model is unsure between two tools fail 21x more often than queries where it is not (Gemma 3 27B). This is not just topic injection: random vectors at the same magnitude give a 0% switch rate, and a probe within a single domain (14 airline tools that share one topic) still reads which tool the model will call at top-1 61-89% across five 4B-14B models. Even base models already carry the right tool internally before they can emit it: reading the chosen tool off the model's internal state (cosine readout) recovers 61-82% accuracy on BFCL while base generation lands at 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. Our results cover single-turn, fixed-menu settings; on multi-turn agent loops the same intervention is less stable (matched-baseline gain or loss of up to 30 percentage points with no consistent direction).

翻译：当工具调用代理选择错误工具时，故障在执行前不可见：邮件已发送，会议已错过。随着代理承担关键操作，一次错误工具调用可能造成实际损害。我们目前无法在错误发生前检查模型内部状态；本文证明这一目标可达成。在模型内部，工具选择由激活空间中的单一方向承载——每对工具对应一个方向。在生成过程中添加该方向可切换模型所选工具。在涵盖 Gemma 3、Qwen 3、Qwen 2.5 和 Llama 3.1（参数规模 270M 至 27B）的 12 个指令微调模型和 6 个基础模型上，该技术在 4B+ 参数指令微调模型上对 15 工具合成基准测试的准确率达 83-100%，对真实 API 基准测试 τ-bench 航空模块的准确率达 77-94%。后续自动生成的 JSON 参数会自适应新工具的模式，因此仅切换工具名称即可生效。相同的逐工具方向还能标记潜在错误：模型在两工具间犹豫的查询失败率是确定情况的 21 倍（Gemma 3 27B）。该现象并非主题注入：相同幅值的随机向量产生 0% 切换率，且针对单一领域（共享同一主题的 14 个航空工具）的探针仍能读取模型将要调用的工具（5 个 4B-14B 模型的 top-1 准确率为 61-89%）。甚至基础模型在生成输出前内部已编码正确工具：通过余弦读取法从模型内部状态读取所选工具，在 BFCL 上恢复 61-82% 准确率（基础模型生成仅达 2-10%），表明预训练形成表示，指令微调随后将其接至输出。本结果覆盖单轮固定选项设置；在多轮代理循环中，相同干预的稳定性较低（匹配基线增益或损失达 30 个百分点，且无一致方向）。