Non-stationarity is a fundamental challenge in multi-agent reinforcement learning (MARL), where agents update their behaviour as they learn. Many theoretical advances in MARL avoid the challenge of non-stationarity by coordinating the policy updates of agents in various ways, including synchronizing times at which agents are allowed to revise their policies. Synchronization enables analysis of many MARL algorithms via multi-timescale methods, but such synchrony is infeasible in many decentralized applications. In this paper, we study an asynchronous variant of the decentralized Q-learning algorithm, a recent MARL algorithm for stochastic games. We provide sufficient conditions under which the asynchronous algorithm drives play to equilibrium with high probability. Our solution utilizes constant learning rates in the Q-factor update, which we show to be critical for relaxing the synchrony assumptions of earlier work. Our analysis also applies to asynchronous generalizations of a number of other algorithms from the regret testing tradition, whose performance is analyzed by multi-timescale methods that study Markov chains obtained via policy update dynamics. This work extends the applicability of the decentralized Q-learning algorithm and its relatives to settings in which parameters are selected in an independent manner, and tames non-stationarity without imposing the coordination assumptions of prior work.
翻译:非平稳性是多智能体强化学习中的基本挑战,其中智能体在学习过程中会更新自身行为。许多多智能体强化学习的理论进展通过以各种方式协调智能体的策略更新来避免非平稳性挑战,包括同步允许智能体修改策略的时间。这种同步性使得通过多时间尺度方法分析许多多智能体强化学习算法成为可能,但在许多分散式应用中这种同步是不可行的。本文研究了一种异步变体的分散式Q学习算法,这是一种用于随机博弈的最新多智能体强化学习算法。我们提供了充分条件,在该条件下异步算法能以高概率将博弈过程驱动至均衡。我们的解决方案在Q因子更新中采用恒定学习率,并证明这对放宽先前工作中的同步假设至关重要。我们的分析同样适用于来自遗憾检验传统的其他若干算法的异步推广形式,这些算法的性能通过研究基于策略更新动力学获得的马尔可夫链的多时间尺度方法进行分析。这项工作将分散式Q学习算法及其衍生算法的适用性扩展到能够以独立方式选择参数的环境,并在不施加先前工作中的协调假设的情况下驯服了非平稳性。