When Should Agent Trust Be Conditional? Characterizing and Attacking Skill-Conditional Reputation in Agent Swarms

Open platforms increasingly route tasks among heterogeneous LLM agents--differing in base model, scaffold, and tool stack--whose competence varies sharply by skill: an agent excellent at one skill may be useless at another. The standard reputation approach summarizes each agent by a single global trust score, but that scalar is the wrong object here, because routing every task to the globally most-trusted agent leaves the value of specialization unclaimed. We study skill-conditional trust R(i | k)--the trust to place in agent i for a task requiring skill k, rather than one score per agent--and pose three falsifiable questions: when is conditioning worth it, how much cross-skill evidence should be borrowed, and whether that borrowing is safe. A controlled phase-diagram analysis answers the first two: conditional trust wins only in a specific regime--high agent heterogeneity, sparse per-skill evidence, and correlated skills--and the coupling strength beta that buys this data efficiency is dual-use, because the same cross-skill borrowing is also a laundering channel. On a public benchmark of 14 genuinely heterogeneous AppWorld agents, real pools land inside the beneficial regime--a small but genuine gain, with the per-skill best agent genuinely changing across skills. We then show that an attacker with cheap evidence in one skill and none in a target skill hijacks the conditional router, driving routing regret from 0 to 0.94 on a pool our zero-cost Conditional Information Value Test (CIVT) rates GREEN--while the ungated trust verdict it contaminates reads -0.06 instead of the honest +0.19. A zero-evidence gate bounds the attack but does not eliminate it; we characterize the residual cost under an explicit budget. We do not claim Sybil-resistance--we quantify the trade-off.

翻译：开放平台日益将任务路由至异构的大语言模型智能体（LLM agents）——这些智能体在基座模型、架构框架及工具栈上存在差异，其能力因技能而异：擅长某项技能的智能体可能在另一项技能上毫无用处。标准声誉方法通过单一全局信任分数概括每个智能体，但该标量在此场景中并不适用，因为将所有任务路由至全局最受信任的智能体会导致专业化价值无法实现。我们研究技能条件性信任R(i|k)——即针对需要技能k的任务，对智能体i应赋予的信任，而非每个智能体单一的分数——并提出三个可证伪的问题：何时值得进行条件化、应借用多少跨技能证据、以及这种借用是否安全。通过受控相图分析回答了前两个问题：条件性信任仅在特定区域获胜——即高智能体异质性、每技能证据稀疏且技能相关的情形——而实现数据效率优势的耦合强度β具有双重用途，因为同一跨技能借用机制也是洗钱渠道。在包含14个真实异构AppWorld智能体的公开基准上，实际智能体池恰好位于有利区域——存在微小但真实的增益，且每技能最佳智能体确实随技能变化。我们进一步展示，攻击者利用在某技能上获取的低成本证据，在目标技能零证据条件下劫持了条件性路由器，将路由遗憾从0升至0.94——而通过我们零成本条件信息值测试（CIVT）评级为绿色（GREEN）的智能体池，其遭受污染的未设防信任评分从真实值+0.19降为-0.06。零证据门控机制可限制该攻击但无法消除；我们在显式预算条件下刻画了残余成本。我们不声称具备女巫攻击防御能力——而是量化其权衡关系。