Although parallelism has been extensively used in reinforcement learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature, which focuses on approaches that encourage agents to explore a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Furthermore, we demonstrate that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.
翻译:尽管并行计算已在强化学习(RL)中广泛使用,但并行探索的量化效果在理论上仍未被充分理解。我们研究了线性马尔可夫决策过程(MDP)和双人零和马尔可夫博弈(MG)中简单并行探索方法在无奖励强化学习中的优势。与现有文献侧重于鼓励智能体探索多样化策略集合的方法不同,我们证明在所有情况下,使用单一策略引导所有智能体的探索,相比于完全顺序执行方案即可获得近乎线性的加速效果。此外,我们证明该简单程序在线性MDP的无奖励设置下接近极小化最优。从实践角度而言,本文表明单一策略足以在探索阶段融入并行机制,且具有可证明的近乎最优性。