Average-reward reinforcement learning offers a principled framework for long-term decision-making by maximizing the mean reward per time step. Although Q-learning is a widely used model-free algorithm with established sample complexity in discounted and finite-horizon Markov decision processes (MDPs), its theoretical guarantees for average-reward settings remain limited. This work studies a simple but effective Q-learning algorithm for average-reward MDPs with finite state and action spaces under the weakly communicating assumption, covering both single-agent and federated scenarios. For the single-agent case, we show that Q-learning with carefully chosen parameters achieves sample complexity $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{\varepsilon^3}\right)$, where $\|h^{\star}\|_{\mathsf{sp}}$ is the span norm of the bias function, improving previous results by at least a factor of $\frac{\|h^{\star}\|_{\mathsf{sp}}^2}{\varepsilon^2}$. In the federated setting with $M$ agents, we prove that collaboration reduces the per-agent sample complexity to $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{M\varepsilon^3}\right)$, with only $\widetilde{O}\left(\frac{\|h^{\star}\|_{\mathsf{sp}}}{\varepsilon}\right)$ communication rounds required. These results establish the first federated Q-learning algorithm for average-reward MDPs, with provable efficiency in both sample and communication complexity.
翻译:平均奖励强化学习通过最大化每步平均奖励,为长期决策提供了一个原则性框架。尽管Q学习是一种广泛使用的无模型算法,在折扣和有限时域马尔可夫决策过程(MDPs)中已建立样本复杂度理论,但其在平均奖励设定下的理论保证仍然有限。本研究针对弱通信假设下具有有限状态和动作空间的平均奖励MDPs,分析了一种简单而有效的Q学习算法,涵盖单智能体和联邦两种场景。对于单智能体情况,我们证明参数经过精心选择的Q学习可实现 $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{\varepsilon^3}\right)$ 的样本复杂度,其中 $\|h^{\star}\|_{\mathsf{sp}}$ 为偏差函数的跨度范数,该结果较先前研究至少提升了 $\frac{\|h^{\star}\|_{\mathsf{sp}}^2}{\varepsilon^2}$ 倍。在包含 $M$ 个智能体的联邦场景中,我们证明协作可将单智能体样本复杂度降低至 $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|_{\mathsf{sp}}^3}{M\varepsilon^3}\right)$,且仅需 $\widetilde{O}\left(\frac{\|h^{\star}\|_{\mathsf{sp}}}{\varepsilon}\right)$ 轮通信。这些成果首次建立了针对平均奖励MDPs的联邦Q学习算法,并在样本复杂度和通信复杂度方面均具有可证明的效率。