MalSkillBench: A Runtime-Verified Benchmark of Malicious Agent Skills

AI coding agents such as Claude Code and Gemini CLI increasingly extend themselves with third-party skills: markdown packages bundling natural-language instructions, executable scripts, and tool permissions. Because a skill is at once code and agent-facing instruction, it introduces a supply chain dependency whose risk is neither pure code nor pure prompt. Detection tools have never been measured against verified ground truth spanning this hybrid space, leaving their effectiveness unknown and wild-only evaluations biased. We present MalSkillBench, the first runtime-verified benchmark of malicious agent skills: 3,944 malicious skills labeled along a three-dimensional taxonomy of 108 cells. Of these, 3,214 come from a closed-loop Generate-Verify-Feedback pipeline admitting only samples whose malicious behavior fires inside a Docker sandbox under system-call monitoring and an LLM judge; we add 703 in-the-wild and 4,000 matched benign skills. Our measurements are consistent: code injection reaches 94.5% verification yield but prompt injection only 75.8%, the same fragility that later makes it hard to detect; the wild sample is narrow, dominated by one cryptocurrency-theft campaign (86.6% one behavior, 81% from two accounts) with a small but architecturally new tail attacking the agent control plane; the strongest skill-specific detector reaches 98.4% recall on code injection yet collapses on prompt-injection and agent-control attacks, and wild-only scoring swings the ranking by up to 66 recall points; supply-chain scanners and prompt-injection defenses each see only half of a skill, and no combination recovers the code-instruction relationship. Detecting malicious skills therefore requires reasoning jointly over task intent, code, and instructions. We release the dataset, pipeline, baselines, and results.

翻译：诸如Claude Code和Gemini CLI等AI编码智能体日益通过第三方技能进行扩展：这些技能是将自然语言指令、可执行脚本与工具权限捆绑在一起的Markdown包。由于技能同时包含代码与面向智能体的指令，它引入了一种既非纯代码也非纯提示的供应链依赖风险。检测工具从未针对跨越这一混合空间的经验证基线进行衡量，这导致其有效性未知，且仅依赖于野外的评估存在偏差。我们提出了MalSkillBench——首个基于运行时验证的恶意智能体技能基准测试：包含3,944个恶意技能，沿三维分类体系标记至108个单元格。其中3,214个来自闭环的“生成-验证-反馈”管道，仅接受在系统调用监控和LLM评判器下的Docker沙箱中触发恶意行为的样本；我们额外增加了703个野外样本与4,000个匹配的正常技能。我们的测量结果一致：代码注入达到94.5%的验证通过率，而提示注入仅为75.8%，这种脆弱性也使其后期更难被检测；野外样本分布狭窄，由一项加密货币盗窃活动主导（86.6%为单一行为，81%来自两个账户），但存在少量架构新颖的尾部样本用于攻击智能体控制平面；性能最强的技能专用检测器在代码注入上达到98.4%的召回率，但在提示注入与智能体控制攻击上失效，而仅基于野外的评分使排名波动高达66个召回点；供应链扫描器与提示注入防御各自仅能识别技能的一半，并且没有任何组合能恢复代码与指令之间的关联。因此，检测恶意技能需要联合推理任务意图、代码与指令。我们发布了数据集、管道、基线与结果。