When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe installed skill sets. We present SkillReact, a compositional security measurement framework with three components: a deterministic static-composition benchmark, a two-rater LLM-assisted human-adjudication pipeline, and an action-based exploitability harness. On 1,520 ClawHub skills, 651 pass individual inspection and form 211,575 pairs; the benchmark flags 22.25% of these as structural candidates. We treat this raw rate as a recall-oriented scanner ceiling and calibrate it against human judgment: in a pattern-stratified audit, roughly one in five flagged pair-pattern hits survives as a real compositional risk (population-weighted validity 18.2%, our headline result), implying about 14K genuine risk memberships in a single registry that per-skill scanning misses by construction, since every pair is individually safe. An action-based harness then probes when these candidates become model-issued tool calls, and finds realization gated by host-model disposition: on an anchor-conditioned dropper subset, Haiku-4-5 issues the dropper-stage tool call on all 39 direct-prompt trials (36 of them the full download-then-execute chain, 3 download-only), Opus-4-7 stops at the download, and Sonnet-4-6 refuses outright. A control that holds the request fixed and varies only the installed skills finds compliance highest with no skills installed: a composition fixes which capabilities are reachable, while the host model decides whether to use them. Together these motivate install-time compositional checks and capability isolation as complements to per-skill scanning.

翻译：大语言模型智能体越来越依赖社区贡献的技能，这些技能扩展了智能体的操作能力集。我们研究了智能体AI系统中的一个核心安全问题：个别安全的技能组合是否可能形成不安全的安装技能集。我们提出了SkillReact，一个组合安全测量框架，包含三个组成部分：一个确定性静态组合基准、一个双评分者LLM辅助人工裁决流程，以及一个基于动作的可利用性测试套件。在1,520个ClawHub技能中，651个通过了单独检查并形成了211,575个技能对；基准将其中22.25%标记为结构候选。我们将这一原始比率视为以召回率为导向的扫描上限，并对照人类判断进行校准：在按模式分层的审计中，大约每五个被标记的模式对命中中有一个存活为真正的组合风险（人口加权有效性为18.2%，这是我们的主要结果），这意味着在单个注册表中，按技能扫描必然会遗漏约14,000个真实风险成员，因为每一对技能本身都是安全的。然后，一个基于动作的测试套件探测这些候选何时成为模型发出的工具调用，并发现实现受宿主模型倾向的限制：在锚定条件的dropper子集上，Haiku-4-5在所有39次直接提示试验中发出了dropper阶段的工具调用（其中36次为完整的下载然后执行链，3次仅下载），Opus-4-7停留在下载阶段，而Sonnet-4-6完全拒绝。一个控制实验固定请求不变只改变安装的技能发现，不安装任何技能时合规性最高：组合决定了哪些能力可达，而宿主模型决定是否使用这些能力。这些结果共同表明，安装时组合检查和能力隔离应作为按技能扫描的补充。