Independent learning (IL), despite being a popular approach in practice to achieve scalability in large-scale multi-agent systems, usually lacks global convergence guarantees. In this paper, we study two representative algorithms, independent $Q$-learning and independent natural actor-critic, within value-based and policy-based frameworks, and provide the first finite-sample analysis for approximate global convergence. The results imply a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2})$ up to an error term that captures the dependence among agents and characterizes the fundamental limit of IL in achieving global convergence. To establish the result, we develop a novel approach for analyzing IL by constructing a separable Markov decision process (MDP) for convergence analysis and then bounding the gap due to model difference between the separable MDP and the original one. Moreover, we conduct numerical experiments using a synthetic MDP and an electric vehicle charging example to verify our theoretical findings and to demonstrate the practical applicability of IL.
翻译:独立学习(IL)尽管在实践中作为实现大规模多智能体系统可扩展性的流行方法,通常缺乏全局收敛性保证。本文在基于值和基于策略的框架下研究了两种代表性算法——独立Q学习和独立自然演员-评论家,并首次提供了近似全局收敛的有限样本分析。结果表明达到$\tilde{\mathcal{O}}(\epsilon^{-2})$的样本复杂度,其误差项捕捉了智能体间的相互依赖性,并刻画了IL实现全局收敛的基本极限。为建立该结果,我们开发了一种分析IL的新方法:通过构建可分离马尔可夫决策过程(MDP)进行收敛性分析,进而界定可分离MDP与原始模型间差异所导致的误差间隙。此外,我们通过合成MDP和电动汽车充电案例的数值实验验证了理论发现,并证明了IL的实际适用性。