We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different initializations will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive evaluation of independent PPO, arguably the standard baseline deep multi-agent policy gradient algorithm, in the Hanabi, Overcooked and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the decrease in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi in particular we achieve a new SOTA in inter-seed cross-play this way. While we give examples of Dec-POMDPs in which one cannot learn the optimal symmetry-equivariant policy this way, both our theoretical and empirical results suggest that one should consider far higher entropy coefficients during hyperparameter sweeps in Dec-POMDPs than is typically done. Code for our experiments can be found at https://github.com/jforkel/JAX-OBL
翻译:我们证明,在任何Dec-POMDP中,足够高的熵正则化可确保采用表格softmax参数化的策略梯度流始终收敛到相同的联合策略,且该联合策略关于Dec-POMDP的所有对称性是等变的。特别地,不同初始化产生的策略将完全兼容——其交叉对局回报等于自对局回报。通过在Hanabi、Overcooked和Yokai环境中对独立PPO(可视为标准深度多智能体策略梯度算法基线)进行广泛评估,我们发现熵系数对独立训练策略间的交叉对局回报有重大影响,且通过增加熵正则化带来的自对局回报下降通常可通过训练后对所学策略进行贪心化来抵消。特别是在Hanabi中,我们以此方式实现了种子间交叉对局的新SOTA。尽管我们给出了无法通过此方式学习最优对称等变策略的Dec-POMDP实例,但理论与实证结果均表明:在Dec-POMDP的超参数扫描中,熵系数的取值应远高于常规做法。实验代码见https://github.com/jforkel/JAX-OBL