Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as $|\mathcal{A}| \to \infty$, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a Holder smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also Holder smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.
翻译:低秩马尔可夫决策过程(MDP)近期在强化学习(RL)领域中崭露头角,成为一种有前景的框架,因为它既能在表示学习中融入机器学习算法,又能提供可证明近似正确(PAC)的学习保证。然而,当前针对低秩MDP的方法存在局限性,仅考虑有限动作空间,且在$|\mathcal{A}| \to \infty$时给出无效界限,极大限制了其适用性。本研究探讨如何将这些方法扩展到连续动作场景,并探索多种具体实施方案。以开创性FLAMBE算法(Agarwal等人,2020)为例——这是一种针对低秩MDP的奖励无关PAC强化学习方法——我们证明,在不修改算法的情况下,当动作空间允许连续时,仍可获得类似的PAC界限。具体而言,当转移函数模型满足关于动作的Hölder光滑性条件,且策略类别具有一致有界的最小密度或奖励函数同样满足Hölder光滑性时,我们得到了一个依赖于光滑阶数的多项式PAC界限。