可证明的奖励无关马尔可夫决策过程中多智能体协同探索 (Provable Cooperative Multi-Agent Exploration for Reward-Free MDPs)

We study cooperative multi-agent reinforcement learning in the setting of reward-free exploration, where multiple agents jointly explore an unknown MDP in order to learn its dynamics (without observing rewards). We focus on a tabular finite-horizon MDP and adopt a phased learning framework. In each learning phase, multiple agents independently interact with the environment. More specifically, in each learning phase, each agent is assigned a policy, executes it, and observes the resulting trajectory. Our primary goal is to characterize the tradeoff between the number of learning phases and the number of agents, especially when the number of learning phases is small. Our results identify a sharp transition governed by the horizon $H$. When the number of learning phases equals $H$, we present a computationally efficient algorithm that uses only $\tilde{O}(S^6 H^6 A / ε^2)$ agents to obtain an $ε$ approximation of the dynamics (i.e., yields an $ε$-optimal policy for any reward function). We complement our algorithm with a lower bound showing that any algorithm restricted to $ρ< H$ phases requires at least $A^{H/ρ}$ agents to achieve constant accuracy. Thus, we show that it is essential to have an order of $H$ learning phases if we limit the number of agents to be polynomial.

翻译：我们研究奖励无关探索设定下的协同多智能体强化学习，其中多个智能体联合探索一个未知的马尔可夫决策过程以学习其动态特性（不观察奖励）。我们聚焦于表格型有限时域马尔可夫决策过程，并采用分阶段学习框架。在每个学习阶段，多个智能体独立与环境交互。具体而言，在每个学习阶段，每个智能体被分配一个策略并执行该策略，同时观察由此产生的轨迹。我们的主要目标是刻画学习阶段数量与智能体数量之间的权衡关系，特别是在学习阶段数量较少的情况下。我们的结果揭示了一个由时域 $H$ 主导的急剧转变：当学习阶段数量等于 $H$ 时，我们提出了一种计算高效的算法，该算法仅需 $\tilde{O}(S^6 H^6 A / ε^2)$ 个智能体即可获得动态特性的 $ε$ 近似（即对任意奖励函数都能产生 $ε$ 最优策略）。我们通过下界结果对该算法进行了补充，证明任何限制在 $ρ< H$ 个阶段的算法都需要至少 $A^{H/ρ}$ 个智能体才能达到恒定精度。因此，我们证明了若将智能体数量限制为多项式级，则必须具有 $H$ 量级的学习阶段。

相关内容

马尔可夫决策过程

关注 23

马尔可夫决策过程（MDP）提供了一个数学框架，用于在结果部分随机且部分受决策者控制的情况下对决策建模。 MDP可用于研究通过动态编程和强化学习解决的各种优化问题。 MDP至少早在1950年代就已为人所知（参见）。马尔可夫决策过程的研究核心是罗纳德·霍华德（Ronald A. Howard）于1960年出版的《动态编程和马尔可夫过程》一书。它们被广泛用于各种学科，包括机器人技术，自动控制，经济学和制造。更精确地，马尔可夫决策过程是离散的时间随机控制过程。在每个时间步骤中，流程都处于某种状态，决策者可以选择该状态下可用的任何操作。该过程在下一时间步响应，随机进入新状态，并给予决策者相应的奖励。流程进入新状态的可能性受所选动作的影响。具体而言，它由状态转换函数给出。因此，下一个状态取决于当前状态和决策者的动作。但是给定和，它有条件地独立于所有先前的状态和动作；换句话说，MDP进程的状态转换满足Markov属性。马尔可夫决策过程是马尔可夫链的扩展。区别在于增加了动作（允许选择）和奖励（给予动机）。相反，如果每个状态仅存在一个动作（例如“等待”）并且所有奖励都相同（例如“零”），则马尔可夫决策过程将简化为马尔可夫链。

《多智能体大语言模型系统的可靠决策研究》

专知会员服务

31+阅读 · 2月2日

多智能体强化学习中的稳健且高效的通信

专知会员服务

25+阅读 · 2025年11月17日

《分布式多智能体强化学习策略的可解释性研究》

专知会员服务

27+阅读 · 2025年11月17日

《空战战术中多智能体强化学习战略决策的可解释性研究》最新报告

专知会员服务

33+阅读 · 2025年9月12日