General function approximation is a powerful tool to handle large state and action spaces in a broad range of reinforcement learning (RL) scenarios. However, theoretical understanding of non-stationary MDPs with general function approximation is still limited. In this paper, we make the first such an attempt. We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs, which subsumes majority of existing tractable RL problems in static MDPs as well as non-stationary MDPs. Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA, which features a sliding window mechanism and a new confidence set design for non-stationary MDPs. We then establish an upper bound on the dynamic regret for the proposed algorithm, and show that SW-OPEA is provably efficient as long as the variation budget is not significantly large. We further demonstrate via examples of non-stationary linear and tabular MDPs that our algorithm performs better in small variation budget scenario than the existing UCB-type algorithms. To the best of our knowledge, this is the first dynamic regret analysis in non-stationary MDPs with general function approximation.
翻译:通用函数近似是在广泛的强化学习场景中处理大规模状态与动作空间的强大工具。然而,对于具有通用函数近似的非平稳马尔可夫决策过程的理论理解仍然有限。本文首次尝试解决这一问题。我们首先为非平稳马尔可夫决策过程提出一种新的复杂度度量指标——动态贝尔曼埃尔达维度,该指标涵盖了静态马尔可夫决策过程及非平稳马尔可夫决策过程中现有的大多数可处理强化学习问题。基于所提出的复杂度指标,我们进一步设计了一种新颖的基于置信集的免模型算法SW-OPEA,该算法采用滑动窗口机制和针对非平稳马尔可夫决策过程的新型置信集设计。随后,我们证明了该算法的动态遗憾上界,并表明当变化预算非显著较大时,SW-OPEA具有可证明的高效性。通过非平稳线性与表格型马尔可夫决策过程的实例,我们进一步证明该算法在较小变化预算场景下的性能优于现有UCB类算法。据我们所知,这是首个在通用函数近似下针对非平稳马尔可夫决策过程进行动态遗憾分析的工作。