SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.

翻译：在智能体工作流中，技能占据着特权地位——智能体被期望隐式遵循并执行这些技能，这使得第三方技能成为易受攻击的攻击面。现有研究已揭示技能攻击引发的智能体不安全行为，但主要局限于在单次任务执行中评估恶意技能，并通过临时风险清单枚举危害。为弥补这些不足，我们提出SkillHarm——一个覆盖技能使用生命周期的技能攻击基准，配套系统性的技能相关风险分类体系。SkillHarm评估两类攻击场景：固定负载投毒（FPP），即固定恶意技能包直接破坏任何调用它的任务会话；以及自变异投毒（SMP），即初始良性的执行静默变异持续性的技能内容，将危害延迟至后续复用时触发。该基准进一步根据攻击影响的智能体工作流组件定义了12种风险类型：数据管道、系统环境和智能体自主性。为规模化实例化这些攻击，我们构建了AutoSkillHarm——一个由自然语言驱动编码智能体的自动化构建流水线。最终基准包含覆盖71个技能的879个攻击样本。实验表明，当前智能体仍存在脆弱性，在FPP和SMP场景中的攻击成功率分别高达86.3%和69.3%。我们的分析进一步揭示潜在风险：许多明显攻击失败源于智能体未能与被投毒文件交互而非真正抵抗，且现有防御措施仍无法可靠缓解该威胁。