We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that policy gradient ascent with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different random seeds will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive empirical evaluation of independent PPO in the Hanabi, Overcooked, and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the drop in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi we achieve a new SOTA in inter-seed cross-play this way. Despite clear limitations of this recipe, which we point out, both our theoretical and empirical results indicate that during hyperparameter sweeps in Dec-POMDPs, one should consider far higher entropy coefficients than is typically done.
翻译:我们证明,在任何Dec-POMDP中,足够高的熵正则化确保采用表格化softmax参数化的策略梯度上升法,无论初始化如何,总是收敛到相同的联合策略,并且该联合策略相对于Dec-POMDP的所有对称性都是等变的。特别地,来自不同随机种子的策略将完全兼容,即它们的交叉对局回报等于其自对局回报。通过对Hanabi、Overcooked和Yokai环境中独立PPO的广泛实证评估,我们发现熵系数对独立训练策略之间的交叉对局回报具有巨大影响,并且由熵正则化增加导致的自对局回报下降通常可以通过在训练后对学习到的策略进行贪心化来抵消。在Hanabi中,我们通过这种方式实现了跨种子交叉对局的新SOTA。尽管我们指出了该方法的明显局限性,但我们的理论和实证结果均表明,在Dec-POMDP的超参数扫描中,应考虑比通常做法高得多的熵系数。