Gradient-based learning in multi-agent systems is difficult because the gradient derives from a first-order model which does not account for the interaction between agents' learning processes. LOLA (arXiv:1709.04326) accounts for this by differentiating through one step of optimization. We extend the ideas of LOLA and develop a fully-general value-based approach to optimization. At the core is a function we call the meta-value, which at each point in joint-policy space gives for each agent a discounted sum of its objective over future optimization steps. We argue that the gradient of the meta-value gives a more reliable improvement direction than the gradient of the original objective, because the meta-value derives from empirical observations of the effects of optimization. We show how the meta-value can be approximated by training a neural network to minimize TD error along optimization trajectories in which agents follow the gradient of the meta-value. We analyze the behavior of our method on the Logistic Game and on the Iterated Prisoner's Dilemma.
翻译:在多智能体系统中,基于梯度的学习存在困难,因为梯度源自一阶模型,无法反映智能体学习过程之间的交互作用。LOLA(arXiv:1709.04326)通过区分单步优化过程来解决这一问题。我们扩展了LOLA的思想,提出一种完全通用的基于价值的优化方法。其核心是一个称为元价值的函数,该函数在联合策略空间的每个点上,为每个智能体提供其目标在未来优化步骤中的折扣求和。我们认为,元价值的梯度比原始目标的梯度提供了更可靠的改进方向,因为元价值源自对优化效果的经验观察。我们展示了如何通过训练神经网络来近似元价值,该网络沿着优化轨迹最小化TD误差,其中智能体遵循元价值的梯度。我们通过逻辑博弈和迭代囚徒困境分析了我们方法的行为。