Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (591, 12865, 17253, and 36001 tokens) and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $χ^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.
翻译:智能体技能(Agent Skills)作为推理时加载到大语言模型智能体中的结构化程序性知识包,被广泛报道能在不同领域将任务通过率平均提升16.2个百分点。然而,同一基准测试也显示出巨大差异:在84项任务中,有16项在引入技能后出现性能下降。学界尚未阐明技能何时发挥作用、何时仅是冗余负担的清晰机制。我们重新分析了一项最近发表的、包含180次运行的受控研究——该研究基于MCP的自主夺旗(CTF)智能体在四种文档条件下(分别为591、12865、17253和36001个token)的表现,结果发现这些条件几乎精确对应无技能组、经验技能组、精选技能组和全面技能组的消融实验。在现有技能基准测试尚未深入覆盖的进攻性网络安全领域,技能的边际效益完全消失。无技能组与全技能组之间的差异仅为8.9个百分点(p=0.71,χ²检验;p=0.25,Cochran-Armitage趋势检验;六组配对Cohen's h值中有五组低于0.2的弱效应阈值)。我们认为缺失的关键变量是“环境反馈带宽”。当智能体的工具层返回严格、符合模式验证且低延迟的观测结果时,环境本身便提供了通常需要技能来提供的程序性校正信号。因此,精选技能的边际效益显著降低,在某些情况下(例如我们的时序侧信道设置)甚至会导致性能退化。我们提出了一个可证伪的假设,勾勒了其对复合AI系统的设计启示,并将发布重分析流程以支持结果复现。