Reinforcement learning (RL) under changing environment models many real-world applications via nonstationary Markov Decision Processes (MDPs), and hence gains considerable interest. However, theoretical studies on nonstationary MDPs in the literature have mainly focused on tabular and linear (mixture) MDPs, which do not capture the nature of unknown representation in deep RL. In this paper, we make the first effort to investigate nonstationary RL under episodic low-rank MDPs, where both transition kernels and rewards may vary over time, and the low-rank model contains unknown representation in addition to the linear state embedding function. We first propose a parameter-dependent policy optimization algorithm called PORTAL, and further improve PORTAL to its parameter-free version of Ada-PORTAL, which is able to tune its hyper-parameters adaptively without any prior knowledge of nonstationarity. For both algorithms, we provide upper bounds on the average dynamic suboptimality gap, which show that as long as the nonstationarity is not significantly large, PORTAL and Ada-PORTAL are sample-efficient and can achieve arbitrarily small average dynamic suboptimality gap with polynomial sample complexity.
翻译:强化学习在变化环境下的应用通过非平稳马尔可夫决策过程(MDPs)模拟了许多现实场景,因而受到广泛关注。然而,现有文献对非平稳MDPs的理论研究主要集中于表格型和线性(混合)MDPs,未能捕捉深度强化学习中未知表示的本质。本文首次尝试研究情节式低秩MDPs下的非平稳强化学习,其中转移核和奖励均可能随时间变化,且低秩模型除线性状态嵌入函数外还包含未知表示。我们首先提出一种参数依赖的策略优化算法PORTAL,并进一步将其改进为无参数版本Ada-PORTAL,该算法无需任何非平稳性先验知识即可自适应调整超参数。针对两种算法,我们给出了平均动态次优性间隙的上界,表明只要非平稳性不是显著过大,PORTAL和Ada-PORTAL具有样本高效性,并能在多项式样本复杂度下实现任意小的平均动态次优性间隙。