We address the challenge of exploration in reinforcement learning (RL) when the agent operates in an unknown environment with sparse or no rewards. In this work, we study the maximum entropy exploration problem of two different types. The first type is visitation entropy maximization previously considered by Hazan et al.(2019) in the discounted setting. For this type of exploration, we propose a game-theoretic algorithm that has $\widetilde{\mathcal{O}}(H^3S^2A/\varepsilon^2)$ sample complexity thus improving the $\varepsilon$-dependence upon existing results, where $S$ is a number of states, $A$ is a number of actions, $H$ is an episode length, and $\varepsilon$ is a desired accuracy. The second type of entropy we study is the trajectory entropy. This objective function is closely related to the entropy-regularized MDPs, and we propose a simple algorithm that has a sample complexity of order $\widetilde{\mathcal{O}}(\mathrm{poly}(S,A,H)/\varepsilon)$. Interestingly, it is the first theoretical result in RL literature that establishes the potential statistical advantage of regularized MDPs for exploration. Finally, we apply developed regularization techniques to reduce sample complexity of visitation entropy maximization to $\widetilde{\mathcal{O}}(H^2SA/\varepsilon^2)$, yielding a statistical separation between maximum entropy exploration and reward-free exploration.
翻译:我们研究了强化学习中智能体在未知环境(稀疏或无奖励)下的探索挑战。本文针对两种不同类型的最大熵探索问题展开研究。第一种是此前Hazan等人(2019)在折扣设置下研究的访问熵最大化问题。针对此类探索,我们提出一种基于博弈论的算法,其样本复杂度为$\widetilde{\mathcal{O}}(H^3S^2A/\varepsilon^2)$,从而改进了现有结果的$\varepsilon$依赖关系,其中$S$表示状态数,$A$表示动作数,$H$表示回合长度,$\varepsilon$表示目标精度。第二种熵类型为轨迹熵,该目标函数与熵正则化马尔可夫决策过程密切相关。我们提出一种简单算法,其样本复杂度为$\widetilde{\mathcal{O}}(\mathrm{poly}(S,A,H)/\varepsilon)$。值得注意的是,这是强化学习文献中首次从理论上证明正则化马尔可夫决策过程在探索方面具有潜在统计优势的理论成果。最后,我们应用所提出的正则化技术,将访问熵最大化的样本复杂度降至$\widetilde{\mathcal{O}}(H^2SA/\varepsilon^2)$,从而在统计层面实现了最大熵探索与无奖励探索之间的分离。