When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (591, 12865, 17253, and 36001 tokens) and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $χ^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.

翻译：智能体技能（Agent Skills）作为推理时加载到大语言模型智能体中的结构化程序性知识包，被广泛报道能在不同领域将任务通过率平均提升16.2个百分点。然而，同一基准测试也显示出巨大差异：在84项任务中，有16项在引入技能后出现性能下降。学界尚未阐明技能何时发挥作用、何时仅是冗余负担的清晰机制。我们重新分析了一项最近发表的、包含180次运行的受控研究——该研究基于MCP的自主夺旗（CTF）智能体在四种文档条件下（分别为591、12865、17253和36001个token）的表现，结果发现这些条件几乎精确对应无技能组、经验技能组、精选技能组和全面技能组的消融实验。在现有技能基准测试尚未深入覆盖的进攻性网络安全领域，技能的边际效益完全消失。无技能组与全技能组之间的差异仅为8.9个百分点（p=0.71，χ²检验；p=0.25，Cochran-Armitage趋势检验；六组配对Cohen's h值中有五组低于0.2的弱效应阈值）。我们认为缺失的关键变量是“环境反馈带宽”。当智能体的工具层返回严格、符合模式验证且低延迟的观测结果时，环境本身便提供了通常需要技能来提供的程序性校正信号。因此，精选技能的边际效益显著降低，在某些情况下（例如我们的时序侧信道设置）甚至会导致性能退化。我们提出了一个可证伪的假设，勾勒了其对复合AI系统的设计启示，并将发布重分析流程以支持结果复现。