The evolution of Large Language Models (LLMs) into agentic systems that perform autonomous reasoning and tool use has created significant intellectual property (IP) value. We demonstrate that these systems are highly vulnerable to imitation attacks, where adversaries steal proprietary capabilities by training imitation models on victim outputs. Crucially, existing LLM watermarking techniques fail in this domain because real-world agentic systems often operate as grey boxes, concealing the internal reasoning traces required for verification. This paper presents AGENTWM, the first watermarking framework designed specifically for agentic models. AGENTWM exploits the semantic equivalence of action sequences, injecting watermarks by subtly biasing the distribution of functionally identical tool execution paths. This mechanism allows AGENTWM to embed verifiable signals directly into the visible action trajectory while remaining indistinguishable to users. We develop an automated pipeline to generate robust watermark schemes and a rigorous statistical hypothesis testing procedure for verification. Extensive evaluations across three complex domains demonstrate that AGENTWM achieves high detection accuracy with negligible impact on agent performance. Our results confirm that AGENTWM effectively protects agentic IP against adaptive adversaries, who cannot remove the watermarks without severely degrading the stolen model's utility.
翻译:大型语言模型(LLM)已演化为能够执行自主推理和工具使用的智能体系统,这创造了巨大的知识产权(IP)价值。我们证明,这些系统极易受到模仿攻击,即攻击者通过基于受害者输出训练模仿模型来窃取其专有功能。关键在于,现有的LLM水印技术在此领域失效,因为现实中的智能体系统通常以灰盒形式运行,隐藏了验证所需的内部推理轨迹。本文提出了首个专为智能体模型设计的水印框架AGENTWM。AGENTWM利用动作序列的语义等价性,通过对功能相同的工具执行路径分布进行细微偏置来注入水印。该机制使得AGENTWM能够将可验证信号直接嵌入可见的动作轨迹中,同时保持对用户的不可区分性。我们开发了自动化流水线以生成鲁棒的水印方案,并建立了严格的统计假设检验流程进行验证。在三个复杂领域的大量实验表明,AGENTWM在几乎不影响智能体性能的情况下实现了高检测精度。我们的结果证实,AGENTWM能有效保护智能体IP免受适应性攻击,攻击者若试图移除水印将严重损害窃取模型的可用性。