AI agents are increasingly deployed to act autonomously in the world, yet there is still no reliable way to trace a harmful agent back to the account that deployed it. This creates the same accountability gap across both ends of the intent spectrum: benign operators may deploy misconfigured or overbroad agents that cause harm unintentionally, while malicious operators may deliberately weaponize agents for scams, harassment, or cyber attacks. In many cases, these agents are powered by vendor-hosted models, a dependency that holds even for sophisticated adversaries such as state actors conducting cyber operations. In either case, affected parties can observe the behavior but cannot notify the responsible operator, stop the session, or identify the account for investigation. We formalize this gap as the problem of agent attribution: linking an observed agent interaction to the responsible account at the hosting vendor. To our knowledge, this is the first work to define the problem and present a practical solution. Our protocol is canary-based: an authorized party injects a canary into the agent's interaction stream, and the vendor searches a narrow window of session logs to recover the originating session and account. Simple canaries suffice in non-adversarial settings. For adversarial operators who filter or paraphrase incoming content, we develop robust canary constructions that cannot be suppressed without degrading the agent's own task performance, yielding a formal asymmetry in the defender's favor. We evaluate a variety of scenarios including real-world agents and show that our attribution method is reliable, robust, and scalable for vendor-side deployment.
翻译:人工智能智能体正越来越多地被部署于自主行动的世界中,然而目前仍缺乏可靠的方法将有害智能体回溯至其部署账户。这种问责缺口横跨意图谱系的两端:善意操作者可能因部署配置不当或权限过大的智能体而无意造成损害,恶意操作者则可能蓄意将智能体武器化用于诈骗、骚扰或网络攻击。在许多案例中,这些智能体由供应商托管的模型驱动,即使面对国家行为体实施网络行动等复杂对手,这种依赖性依然成立。无论何种情况,受影响方虽能观测到行为,却无法通知责任操作者、终止会话或识别用于调查的账户。我们将此缺口形式化为智能体归因问题:将观测到的智能体交互行为关联至托管供应商的负责任账户。据我们所知,这是首个定义该问题并提出实用解决方案的工作。我们的协议基于诱饵技术:授权方在智能体交互流中注入诱饵,供应商通过搜索窄时间窗口内的会话日志来恢复原始会话和账户。在非对抗性场景中,简单诱饵即可满足需求。针对可能过滤或释义输入内容的对抗性操作者,我们开发了稳健的诱饵构造——这些构造在保持智能体自身任务性能的情况下无法被压制,从而在防御者方面形成形式化的不对称优势。我们对包括现实世界智能体在内的多种场景进行了评估,结果表明我们的归因方法可靠、稳健且适用于供应商端部署。