We study the sample complexity of learning in average-reward weakly-coupled Markov decision processes (WCMDPs) and Restless Bandits (RBs) under a generative model. Naive reduction to a tabular MDP leads to high complexity bounds as the state-action space is exponentially large in the number of arms $N$. By exploiting the weakly coupled structure, we show that near-optimal policies can be learned with sample and computational complexities that are polynomial in $N$. Specifically, we analyze the plug-in approach, which applies an efficient planning algorithm to an empirical model estimated from data. For fully heterogeneous WCMDPs, we establish the first finite-sample PAC guarantee with polynomial complexity and an $O(1/\sqrt{N})$ optimality gap. For homogeneous RBs, we further prove that a smaller optimality gap is achievable under mild structural assumptions. A primary technical contribution of our work is a novel Lyapunov-based analysis framework. Unlike classical approaches that rely on the difficult-to-control bias function, our framework uses an explicitly constructed Lyapunov function along with a drift transfer technique between the true and empirical models. A key step of independent interest in our framework is a fine-grained perturbation analysis for the underlying linear programming (LP) relaxation, which provides a general tool for analyzing LP-based policies and weakly-coupled systems.
翻译:我们研究了在生成模型下,平均奖励弱耦合马尔可夫决策过程(WCMDPs)和休止式赌博机(RBs)的学习样本复杂度。直接简化为表格型MDP会导致极高的复杂度上界,因为其状态-动作空间随臂数$N$呈指数级增长。通过利用弱耦合结构,我们证明能够以关于$N$多项式的样本复杂度和计算复杂度学习到近优策略。具体而言,我们分析了插件方法,该方法将高效的规划算法应用于从数据估计出的经验模型。对于完全异质的WCMDPs,我们建立了首个具有多项式复杂度的有限样本PAC保证,并实现了$O(1/\sqrt{N})$的最优性差距。对于同质RBs,我们进一步证明在温和的结构假设下可实现更小的最优性差距。本研究的主要技术贡献在于一种新颖的李雅普诺夫分析框架。与依赖难以控制的偏差函数的经典方法不同,我们的框架使用显式构造的李雅普诺夫函数,并结合了真实模型与经验模型之间的漂移传递技术。该框架中一个具有独立意义的关键步骤是对底层线性规划(LP)松弛的精细扰动分析,这为分析基于LP的策略及弱耦合系统提供了通用工具。