Gradient-based learning in multi-agent systems is difficult because the gradient derives from a first-order model which does not account for the interaction between agents' learning processes. LOLA (arXiv:1709.04326) accounts for this by differentiating through one step of optimization. We propose to judge joint policies by their long-term prospects as measured by the meta-value, a discounted sum over the returns of future optimization iterates. We apply a form of Q-learning to the meta-game of optimization, in a way that avoids the need to explicitly represent the continuous action space of policy updates. The resulting method, MeVa, is consistent and far-sighted, and does not require REINFORCE estimators. We analyze the behavior of our method on a toy game and compare to prior work on repeated matrix games.
翻译:在多智能体系统中,基于梯度的学习存在困难,因为梯度源自一阶模型,该模型未考虑智能体学习过程之间的交互作用。LOLA(arXiv:1709.04326)通过在一次优化步骤中进行微分来应对这一问题。我们提出根据长期前景来判断联合策略,这一前景通过元价值(即未来优化迭代回报的折现总和)来衡量。我们将Q-学习的一种形式应用于优化的元博弈,从而避免显式表示策略更新的连续动作空间。由此产生的方法MeVA具备一致性和远见性,且无需使用REINFORCE估计器。我们在一个博弈玩具上分析了该方法的行为,并与先前在重复矩阵博弈上的工作进行了比较。