In cooperative multi-agent reinforcement learning (MARL) settings, the centralized training with decentralized execution (CTDE) becomes customary recently due to the physical demand. However, the most dilemma is the inconsistency of jointly-trained policies and individually-optimized actions. In this work, we propose a novel value-based multi-objective learning approach, named Tchebycheff value decomposition optimization (TVDO), to overcome the above dilemma. In particular, a nonlinear Tchebycheff aggregation method is designed to transform the MARL task into multi-objective optimal counterpart by tightly constraining the upper bound of individual action-value bias. We theoretically prove that TVDO well satisfies the necessary and sufficient condition of individual global max (IGM) with no extra limitations, which exactly guarantees the consistency between the global and individual optimal action-value function. Empirically, in the climb and penalty game, we verify that TVDO represents precisely from global to individual value factorization with a guarantee of the policy consistency. Furthermore, we also evaluate TVDO in the challenging scenarios of StarCraft II micromanagement tasks, and extensive experiments demonstrate that TVDO achieves more competitive performances than several state-of-the-art MARL methods.
翻译:在合作式多智能体强化学习(MARL)场景中,由于物理需求,集中式训练与分散式执行(CTDE)近期已成为常规范式。然而,主要困境在于联合训练策略与个体优化动作之间存在不一致性。本文提出一种新颖的基于值的多目标学习方法——切比雪夫值分解优化(TVDO),以克服上述困境。具体而言,通过设计非线性切比雪夫聚合方法,该方法通过严格约束个体动作值偏差的上界,将MARL任务转化为多目标优化问题。我们从理论上证明,TVDO无需额外限制即可满足个体全局最大值(IGM)的充要条件,从而精确保证全局与个体最优动作值函数之间的一致性。在攀登与惩罚游戏的实验中,我们验证了TVDO能精准实现从全局到个体的值分解,并保证策略一致性。此外,在星际争霸II微观管理任务的挑战性场景中进行的广泛实验表明,TVDO相比多种最先进的MARL方法取得了更具竞争力的性能表现。