Learning to Contest: Decentralized Robust Fairness in Cooperative MARL via Cross-Attention

Fair cooperative multi-agent RL (MARL) teams maximizing egalitarian welfare are exploitable: a single selfish agent free-rides on the surplus fair agents forgo to raise the worst-off. A centralized need-based allocator removes it, but only by taking allocation out of agents' hands; whether decentralized policies can be robust was left open. We show this futility is an artifact of all-or-nothing contention. Under graded contention (a contested resource delivers $1-c$, wasting $c$), we prove that for any $c<1$ a worst-off cooperator that contests a free-rider strictly improves on yielding, so decentralized leverage exists (Prop. 1). Realizing it is a coordination problem under uncertainty: the number of free-riders is unknown and variable, so any fixed rule is dominated. We introduce CAN, a permutation-equivariant cross-attention policy over agents' observed behaviour that infers the number of free-riders and responds proportionally: turn-taking when none, contesting just enough when some. Trained against an adversarial league (PSRO), CAN keeps best-response exploitability low ($ρ\approx1.2$-$1.5$, vs. $ρ=N$ unprotected) across the contention range, wasting almost nothing at $D=0$ (efficiency $\approx1.0$) and retaining most of it at $D\geq1$ (efficiency 0.83-0.96), approaching the centralized oracle on both axes, no central allocator. Fair-MARL learners fail on complementary axes (GGF/FEN yield and are exploitable, SOTO all-contests and wastes), while CAN is both. On two further games we find clear scope, not blanket generality: CAN stays efficient and Pareto-dominates the fair learners, but its robustness holds only in proportion to the contest leverage: strong on a multi-server game, partial when it weakens, absent under winner-take-all (Prop. 1 fails). We also report its fragilities: weak leverage and zero-shot transfer to larger teams degrade it at high contention.

翻译：追求平等福利的公平合作多智能体强化学习（MARL）团队具有可剥削性：单个自私智能体通过搭便车行为，利用其他公平智能体为提高最差表现而放弃的剩余收益获利。集中式按需分配机制虽能消除这一问题，但须将分配权从智能体手中剥离；去中心化策略是否具备鲁棒性仍属未解之谜。我们证明该局限性源于全有或全无的竞争模式。在分级竞争机制下（竞争资源提供$1-c$收益，浪费$c$），我们证明对于任意$c<1$，与搭便车者展开竞争的最差合作者，其收益严格优于让步策略，因此存在去中心化杠杆效应（命题1）。实现该效应需解决不确定性下的协调问题：搭便车者数量未知且动态变化，故任何固定规则均非最优。我们提出CAN——一种基于智能体观测行为的排列等变交叉注意力策略，能够推断搭便车者数量并做出比例响应：无人搭便车时轮流竞争，存在搭便车者时仅适度竞争。经对抗联盟（PSRO）训练后，CAN在竞争强度范围内保持低最优响应可剥削性（$ρ\approx1.2$-$1.5$，无保护时$ρ=N$），在$D=0$时几乎零浪费（效率$\approx1.0$），在$D\geq1$时仍保留大部分效率（0.83-0.96），双维度均逼近集中式基准方案且无需中央分配器。公平MARL学习者在互补维度失效（GGF/FEN易让步且被剥削，SOTO无差别竞争造成浪费），而CAN同时满足两项要求。在另外两个博弈实验中，我们观察到明确适用范围而非普适性：CAN保持高效且帕累托优于公平学习者，但其鲁棒性仅与竞争杠杆效应成比例：在多服务器博弈中表现强劲，杠杆效应减弱时部分退化，赢家通吃场景下完全失效（命题1失败）。我们还发现其脆弱性：弱杠杆效应及向更大团队的零样本迁移会在高竞争强度下造成性能衰减。