Offline multi-agent reinforcement learning (MARL) faces a critical challenge: the joint action space grows exponentially with the number of agents, making dataset coverage exponentially sparse and out-of-distribution (OOD) joint actions unavoidable. Partial Action Replacement (PAR) mitigates this by anchoring a subset of agents to dataset actions, but existing approach relies on enumerating multiple subset configurations at high computational cost and cannot adapt to varying states. We introduce PLCQL, a framework that formulates PAR subset selection as a contextual bandit problem and learns a state-dependent PAR policy using Proximal Policy Optimisation with an uncertainty-weighted reward. This adaptive policy dynamically determines how many agents to replace at each update step, balancing policy improvement against conservative value estimation. We prove a value-error bound showing that the estimation error scales linearly with the expected number of deviating agents. Compared with the previous PAR-based method SPaCQL, PLCQL reduces the number of per-iteration Q-function evaluations from n to 1, significantly improving computational efficiency. Empirically, PLCQL achieves the highest normalised scores on 66% of tasks across MPE, MaMuJoCo, and SMAC benchmarks, outperforming SPaCQL on 84% of tasks while substantially reducing computational cost.
翻译:离线多智能体强化学习面临一个关键挑战:联合动作空间随智能体数量呈指数级增长,导致数据集覆盖呈指数级稀疏,且分布外联合动作不可避免。部分动作替换通过将部分智能体锚定于数据集中的动作来缓解这一问题,但现有方法依赖枚举多种子集配置,计算成本高且无法适应状态变化。我们提出PLCQL框架,将部分动作替换子集选择建模为上下文赌博机问题,并利用不确定性加权奖励的近端策略优化学习状态依赖的部分动作替换策略。该自适应策略动态决定每次更新步骤中替换的智能体数量,在策略改进与保守价值估计之间取得平衡。我们证明了值误差上界,表明估计误差与预期偏离智能体数量成线性关系。与先前基于部分动作替换的SPaCQL方法相比,PLCQL将每次迭代的Q函数评估次数从n次降至1次,显著提升计算效率。实验表明,在MPE、MaMuJoCo和SMAC基准测试中,PLCQL在66%的任务上取得最高标准化分数,在84%的任务上超越SPaCQL,同时大幅降低计算成本。