We consider reinforcement learning (RL) in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment. Specifically, both the reward and state transition functions can evolve over time but their total variations do not exceed a $\textit{variation budget}$. We first develop $\texttt{LSVI-UCB-Restart}$ algorithm, an optimistic modification of least-squares value iteration with periodic restart, and bound its dynamic regret when variation budgets are known. Then we propose a parameter-free algorithm $\texttt{Ada-LSVI-UCB-Restart}$ that extends to unknown variation budgets. We also derive the first minimax dynamic regret lower bound for nonstationary linear MDPs and as a byproduct establish a minimax regret lower bound for linear MDPs unsolved by Jin et al. (2020). Finally, we provide numerical experiments to demonstrate the effectiveness of our proposed algorithms.
翻译:我们考虑在漂移环境下具有线性函数逼近的回合制马尔可夫决策过程(MDPs)中的强化学习(RL)。具体而言,奖励函数和状态转移函数均随时间演化,但其总变差不超过一个$\textit{变差预算}$。我们首先开发了$\texttt{LSVI-UCB-Restart}$算法,该算法是带周期重启的最小二乘值迭代的乐观变体,并在已知变差预算时界定了其动态遗憾。随后,我们提出了一种无参数算法$\texttt{Ada-LSVI-UCB-Restart}$,该算法可推广至未知变差预算的情形。我们还推导了非平稳线性MDPs的首个极小极大动态遗憾下界,并作为副产品建立了Jin等人(2020)未解明的线性MDPs的极小极大遗憾下界。最后,我们提供了数值实验来证明所提出算法的有效性。