SkillMutator: Benchmarking and Defending Language-and-Code Cross-modal Attacks on LLM Agent Skills

Large language model (LLM) agents increasingly extend their capabilities at runtime by loading Agent Skills, which pair natural-language specifications (SKILL.md) with executable scripts and resources. Because a skill's behavior relies on both natural-language instructions and executable code, assessing its safety requires cross-modal reasoning, creating a new language-and-code attack surface. Attackers can present a benign workflow in SKILL.md while embedding implicit directives that steer the agent to exfiltrate sensitive files, even if the scripts appear harmless. This attack surface remains understudied; prior work treats skills merely as prompt-injection vectors or static code artifacts, leaving attacks emerging from cross-modal interactions largely unmeasured. In our evaluation, open-source and commercial skill scanners detect only 2%-8% and 9%-17% of such attacks, respectively. To address this gap, we introduce SkillMutator, the first benchmark for install-time detection of language-and-code cross-modal attacks on Agent Skills. It emulates an adversarial mutation process across 13 attack categories, iteratively refining malicious skills using scanner feedback to make injected behaviors indistinguishable from legitimate workflows. We further propose a four-phase reasoning-trajectory distillation framework to distill frontier-teacher traces into smaller open-weight models. This produces a locally deployable scanner avoiding third-party data exposure and excessive API costs. On the strongest SkillMutator subset (n=76), our distilled model (Qwen2.5-Coder-7B-Instruct) improves detection from 17.1% to 88.2%, surpassing GPT-4o-mini (23.7%) and GPT-5.4-mini (79.0%), and reaching frontier-level GPT-5.4 (86.8%). These results show practical defense against cross-modal attacks is feasible without relying on costly frontier models.

翻译：摘要：大语言模型智能体通过加载"智能体技能"（将自然语言规范文件SKILL.md与可执行脚本及资源配对）在运行时不断增强其能力。由于技能的行为同时依赖于自然语言指令与可执行代码，评估其安全性需要跨模态推理，由此催生了新型语言-代码攻击面。攻击者可在SKILL.md中呈现良性工作流程，同时嵌入引导智能体窃取敏感文件的隐式指令，即使相关脚本看似无害。该攻击面尚未得到充分研究：现有工作仅将技能视作提示注入向量或静态代码产物，导致跨模态交互衍生的攻击在很大程度上未被量化。在我们的评估中，开源和商用技能扫描器对该类攻击的检测率分别仅为2%-8%和9%-17%。为弥补这一空白，我们提出SkillMutator——首个针对智能体技能安装时语言-代码跨模态攻击的基准测试框架。该框架模拟涵盖13类攻击场景的对抗性变异过程，通过迭代利用扫描器反馈优化恶意技能，使注入行为与合法工作流难以区分。我们进一步提出四阶段推理轨迹蒸馏框架，将前沿教师模型的追踪轨迹迁移至较小规模的开源权重模型，从而生成可本地部署的扫描器，避免第三方数据暴露与高昂API成本。在SkillMutator最强子集（n=76）上，经蒸馏的Qwen2.5-Coder-7B-Instruct模型将检测率从17.1%提升至88.2%，超越GPT-4o-mini（23.7%）与GPT-5.4-mini（79.0%），达到前沿模型GPT-5.4水平（86.8%）。结果表明，无需依赖昂贵的前沿模型即可实现跨模态攻击的实用防御。