When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (55, 1{,}478, 1{,}976, and 4{,}147 lines), and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $χ^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.

翻译：智能体技能（Agent Skills）——推理时加载到大语言模型智能体中的结构化程序性知识包——被广泛报道可在不同领域将任务通过率平均提升16.2个百分点。然而，相同基准测试显示显著方差：在84项任务中，16项引入技能后出现负向增量。学术界尚未阐明技能何时有效、何时仅为冗余开销的清晰机制。我们重新分析近期一项包含180次运行的MCP基础自主夺旗（CTF）智能体受控研究，该研究在四种文档条件下展开（55行、1,478行、1,976行及4,147行），证明这些条件几乎完全对应无技能、经验技能、精选技能与全面技能消融实验。在现有技能基准尚未深入覆盖的攻击性网络安全领域，技能的边际收益趋于消失。无技能条件与全技能条件之间的跨度仅为8.9个百分点（p=0.71，χ²检验；p=0.25，Cochran-Armitage趋势检验；六组成对Cohen's h值中五组低于0.2的小效应阈值）。我们认为缺失变量是"环境反馈带宽"。当智能体工具层返回严格、模式验证且低延迟的观测结果时，环境本身即提供了通常需要技能才能实现的程序性校正信号。因此，精选技能的边际收益显著降低，某些情况下（如时序侧信道设置）甚至主动损害性能。我们提出可证伪假说，勾勒其对复合AI系统的设计启示，并将发布重分析流程以支持可重复性研究。