We consider the reinforcement learning (RL) setting, in which the agent has to act in unknown environment driven by a Markov Decision Process (MDP) with sparse or even reward free signals. In this situation, exploration becomes the main challenge. In this work, we study the maximum entropy exploration problem of two different types. The first type is visitation entropy maximization that was previously considered by Hazan et al. (2019) in the discounted setting. For this type of exploration, we propose an algorithm based on a game theoretic representation that has $\widetilde{\mathcal{O}}(H^3 S^2 A / \varepsilon^2)$ sample complexity thus improving the $\varepsilon$-dependence of Hazan et al. (2019), where $S$ is a number of states, $A$ is a number of actions, $H$ is an episode length, and $\varepsilon$ is a desired accuracy. The second type of entropy we study is the trajectory entropy. This objective function is closely related to the entropy-regularized MDPs, and we propose a simple modification of the UCBVI algorithm that has a sample complexity of order $\widetilde{\mathcal{O}}(1/\varepsilon)$ ignoring dependence in $S, A, H$. Interestingly enough, it is the first theoretical result in RL literature establishing that the exploration problem for the regularized MDPs can be statistically strictly easier (in terms of sample complexity) than for the ordinary MDPs.
翻译:我们考虑强化学习(RL)设定,其中智能体需在由马尔可夫决策过程(MDP)驱动的未知环境中行动,且反馈信号稀疏甚至无奖励信号。在此情形下,探索成为主要挑战。本文研究两类最大熵探索问题。第一类是访问熵最大化问题,此前由Hazan等人(2019)在折扣设定中提出。针对此类探索,我们提出一种基于博弈论表示的算法,其样本复杂度为$\widetilde{\mathcal{O}}(H^3 S^2 A / \varepsilon^2)$,从而改进了Hazan等人(2019)中关于$\varepsilon$的依赖关系,其中$S$为状态数,$A$为动作数,$H$为回合长度,$\varepsilon$为目标精度。第二类熵是轨迹熵。该目标函数与熵正则化MDP密切相关,我们提出对UCBVI算法的简单修改,其样本复杂度为$\widetilde{\mathcal{O}}(1/\varepsilon)$(忽略对$S, A, H$的依赖)。有趣的是,这是RL文献中首个理论结果,表明正则化MDP的探索问题在统计上(以样本复杂度衡量)可严格优于普通MDP。