We study cooperative multi-agent reinforcement learning in the setting of reward-free exploration, where multiple agents jointly explore an unknown MDP in order to learn its dynamics (without observing rewards). We focus on a tabular finite-horizon MDP and adopt a phased learning framework. In each learning phase, multiple agents independently interact with the environment. More specifically, in each learning phase, each agent is assigned a policy, executes it, and observes the resulting trajectory. Our primary goal is to characterize the tradeoff between the number of learning phases and the number of agents, especially when the number of learning phases is small. Our results identify a sharp transition governed by the horizon $H$. When the number of learning phases equals $H$, we present a computationally efficient algorithm that uses only $\tilde{O}(S^6 H^6 A / ε^2)$ agents to obtain an $ε$ approximation of the dynamics (i.e., yields an $ε$-optimal policy for any reward function). We complement our algorithm with a lower bound showing that any algorithm restricted to $ρ< H$ phases requires at least $A^{H/ρ}$ agents to achieve constant accuracy. Thus, we show that it is essential to have an order of $H$ learning phases if we limit the number of agents to be polynomial.
翻译:我们研究奖励无关探索设定下的协同多智能体强化学习,其中多个智能体联合探索一个未知的马尔可夫决策过程以学习其动态特性(不观察奖励)。我们聚焦于表格型有限时域马尔可夫决策过程,并采用分阶段学习框架。在每个学习阶段,多个智能体独立与环境交互。具体而言,在每个学习阶段,每个智能体被分配一个策略并执行该策略,同时观察由此产生的轨迹。我们的主要目标是刻画学习阶段数量与智能体数量之间的权衡关系,特别是在学习阶段数量较少的情况下。我们的结果揭示了一个由时域 $H$ 主导的急剧转变:当学习阶段数量等于 $H$ 时,我们提出了一种计算高效的算法,该算法仅需 $\tilde{O}(S^6 H^6 A / ε^2)$ 个智能体即可获得动态特性的 $ε$ 近似(即对任意奖励函数都能产生 $ε$ 最优策略)。我们通过下界结果对该算法进行了补充,证明任何限制在 $ρ< H$ 个阶段的算法都需要至少 $A^{H/ρ}$ 个智能体才能达到恒定精度。因此,我们证明了若将智能体数量限制为多项式级,则必须具有 $H$ 量级的学习阶段。