BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

Agent ecosystems increasingly rely on installable skills to extend functionality, and some skills bundle learned model artifacts as part of their execution logic. This creates a supply-chain risk that is not captured by prompt injection or ordinary plugin misuse: a third-party skill may appear benign while concealing malicious behavior inside its bundled model. We present BadSkill, a backdoor attack formulation that targets this model-in-skill threat surface. In BadSkill, an adversary publishes a seemingly benign skill whose embedded model is backdoor-fine-tuned to activate a hidden payload only when routine skill parameters satisfy attacker-chosen semantic trigger combinations. To realize this attack, we train the embedded classifier with a composite objective that combines classification loss, margin-based separation, and poison-focused optimization, and evaluate it in an OpenClaw-inspired simulation environment that preserves third-party skill installation and execution while enabling controlled multi-model study. Our benchmark spans 13 skills, including 8 triggered tasks and 5 non-trigger control skills, with a combined main evaluation set of 571 negative-class queries and 396 trigger-aligned queries. Across eight architectures (494M--7.1B parameters) from five model families, BadSkill achieves up to 99.5\% average attack success rate (ASR) across the eight triggered skills while maintaining strong benign-side accuracy on negative-class queries. In poison-rate sweeps on the standard test split, a 3\% poison rate already yields 91.7\% ASR. The attack remains effective across the evaluated model scales and under five text perturbation types. These findings identify model-bearing skills as a distinct model supply-chain risk in agent ecosystems and motivate stronger provenance verification and behavioral vetting for third-party skill artifacts.

翻译：智能体生态系统日益依赖可安装的技能来扩展功能，部分技能将学习到的模型工件作为其执行逻辑的一部分进行捆绑。这带来了一种供应链风险，该风险未被提示注入或普通插件滥用所涵盖：第三方技能可能看似无害，同时在其捆绑模型内隐藏恶意行为。我们提出BadSkill，一种针对此类技能内模型威胁面的后门攻击方法。在BadSkill中，攻击者发布一个看似良性的技能，其嵌入的模型通过后门微调，使得只有当常规技能参数满足攻击者选择的语义触发器组合时，才会激活隐藏载荷。为实现此攻击，我们使用包含分类损失、基于边界的分离和投毒导向优化的复合目标来训练嵌入的分类器，并在一个仿OpenClaw的仿真环境中进行评估，该环境保留了第三方技能的安装与执行，同时支持受控的多模型研究。我们的基准测试涵盖13个技能，包括8个触发任务和5个非触发控制技能，组合主评估集包含571个负类查询和396个触发器对齐查询。跨五个模型家族的八种架构（参数规模494M至7.1B），BadSkill在八个触发技能上实现了高达99.5%的平均攻击成功率（ASR），同时在负类查询上保持了强良性侧准确率。在标准测试集划分的投毒率扫描中，3%的投毒率即可达到91.7%的ASR。该攻击在评估的模型规模下以及五种文本扰动类型中仍然有效。这些发现将承载模型的技能识别为智能体生态系统中一种独特的模型供应链风险，并促使对第三方技能工件进行更强的来源验证和行为审查。