We study the sample complexity of learning in average-reward weakly-coupled Markov decision processes (WCMDPs) and Restless Bandits (RBs) under a generative model. Naive reduction to a tabular MDP leads to high complexity bounds as the state-action space is exponentially large in the number of arms $N$. By exploiting the weakly coupled structure, we show that near-optimal policies can be learned with sample and computational complexities that are polynomial in $N$. Specifically, we analyze the plug-in approach, which applies an efficient planning algorithm to an empirical model estimated from data. For fully heterogeneous WCMDPs, we establish the first finite-sample PAC guarantee with polynomial complexity and an $O(1/\sqrt{N})$ optimality gap. For homogeneous RBs, we further prove that a smaller optimality gap is achievable under mild structural assumptions. A primary technical contribution of our work is a novel Lyapunov-based analysis framework. Unlike classical approaches that rely on the difficult-to-control bias function, our framework uses an explicitly constructed Lyapunov function along with a drift transfer technique between the true and empirical models. A key step of independent interest in our framework is a fine-grained perturbation analysis for the underlying linear programming (LP) relaxation, which provides a general tool for analyzing LP-based policies and weakly-coupled systems.
翻译:我们研究了在生成模型下,平均奖励弱耦合马尔可夫决策过程(WCMDPs)和游荡赌博机(RBs)的样本复杂度学习问题。直接简化为表格型MDP会导致极高的复杂度上界,因为状态-动作空间随臂数$N$呈指数增长。通过利用弱耦合结构,我们证明可以用关于$N$多项式级别的样本和计算复杂度学习到近优策略。具体而言,我们分析了"即插即用"方法,即对从数据中估计出的经验模型应用高效规划算法。对于完全异质的WCMDPs,我们首次建立了具有多项式复杂度和$O(1/\sqrt{N})$最优间隙的有限样本PAC保证。对于同质RBs,我们进一步证明在温和的结构假设下可以实现更小的最优间隙。本工作的主要技术贡献在于提出了一种新颖的李雅普诺夫分析框架。不同于依赖难以控制的偏差函数的经典方法,我们的框架使用显式构造的李雅普诺夫函数,并辅以真实模型与经验模型之间的漂移转移技术。该框架中一个具有独立意义的关键步骤是对底层线性规划(LP)松弛进行精细的扰动分析,这为分析基于LP的策略和弱耦合系统提供了通用工具。