AI agents are autonomous systems that combine LLMs with external tools to solve complex tasks. While such tools extend capability, improper tool permissions introduce security risks such as indirect prompt injection and tool misuse. We characterize these failures as unbalanced tool-driven agency. Agents may retain unnecessary permissions (excessive agency) or fail to invoke required tools (insufficient agency), amplifying the attack surface and reducing performance. We introduce AgenTRIM, a framework for detecting and mitigating tool-driven agency risks without altering an agent's internal reasoning. AgenTRIM addresses these risks through complementary offline and online phases. Offline, AgenTRIM reconstructs and verifies the agent's tool interface from code and execution traces. At runtime, it enforces per-step least-privilege tool access through adaptive filtering and status-aware validation of tool calls. Evaluating on the AgentDojo benchmark, AgenTRIM substantially reduces attack success while maintaining high task performance. Additional experiments show robustness to description-based attacks and effective enforcement of explicit safety policies. Together, these results demonstrate that AgenTRIM provides a practical, capability-preserving approach to safer tool use in LLM-based agents.
翻译:人工智能智能体是结合大型语言模型与外部工具以解决复杂任务的自主系统。尽管此类工具扩展了能力,但不恰当的工具权限会引入安全风险,例如间接提示注入和工具滥用。我们将这些故障表征为不平衡的工具驱动代理性。智能体可能保留不必要的权限(代理性过度)或未能调用必需的工具(代理性不足),从而扩大攻击面并降低性能。我们提出AgenTRIM,一个无需改变智能体内部推理即可检测和缓解工具驱动代理性风险的框架。AgenTRIM通过互补的离线和在线阶段应对这些风险。离线阶段,AgenTRIM从代码和执行轨迹中重建并验证智能体的工具接口。在运行时,它通过自适应过滤和工具调用的状态感知验证,强制执行基于步骤的最小权限工具访问。在AgentDojo基准测试上的评估表明,AgenTRIM显著降低了攻击成功率,同时保持了高任务性能。额外实验显示其对基于描述的攻击具有鲁棒性,并能有效执行明确的安全策略。综合来看,这些结果表明AgenTRIM为基于大型语言模型的智能体提供了一种实用且保持能力的工具安全使用方法。