Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (55, 1{,}478, 1{,}976, and 4{,}147 lines), and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $χ^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.
翻译:智能体技能(Agent Skills)——推理时加载到大语言模型智能体中的结构化程序性知识包——被广泛报道可在不同领域将任务通过率平均提升16.2个百分点。然而,相同基准测试显示显著方差:在84项任务中,16项引入技能后出现负向增量。学术界尚未阐明技能何时有效、何时仅为冗余开销的清晰机制。我们重新分析近期一项包含180次运行的MCP基础自主夺旗(CTF)智能体受控研究,该研究在四种文档条件下展开(55行、1,478行、1,976行及4,147行),证明这些条件几乎完全对应无技能、经验技能、精选技能与全面技能消融实验。在现有技能基准尚未深入覆盖的攻击性网络安全领域,技能的边际收益趋于消失。无技能条件与全技能条件之间的跨度仅为8.9个百分点(p=0.71,χ²检验;p=0.25,Cochran-Armitage趋势检验;六组成对Cohen's h值中五组低于0.2的小效应阈值)。我们认为缺失变量是"环境反馈带宽"。当智能体工具层返回严格、模式验证且低延迟的观测结果时,环境本身即提供了通常需要技能才能实现的程序性校正信号。因此,精选技能的边际收益显著降低,某些情况下(如时序侧信道设置)甚至主动损害性能。我们提出可证伪假说,勾勒其对复合AI系统的设计启示,并将发布重分析流程以支持可重复性研究。