Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as $|\mathcal{A}| \to \infty$, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain a similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a H\"older smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also H\"older smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.
翻译:低秩马尔可夫决策过程(MDP)近年来在强化学习(RL)领域崭露头角,成为一种重要框架——它既能提供可证明近似正确(PAC)的学习保证,又可融入机器学习算法进行表示学习。然而,现有低秩MDP方法受限于仅考虑有限动作空间,在$|\mathcal{A}| \to \infty$时会产生无意义的界,极大限制了其应用范围。本研究旨在将此类方法扩展至连续动作场景,并探索多种具体实施途径。以经典算法FLAMBE(Agarwal et al., 2020)为案例——该算法是一种面向低秩MDP的奖励无关型PAC强化学习方法——我们证明:在不修改算法前提下,当动作允许连续时仍可获得类似PAC界。具体而言,当转移函数模型满足关于动作的Hölder光滑性条件,且策略类具有一致有界的最小密度或奖励函数同样满足Hölder光滑性时,我们得到了依赖于光滑阶数的多项式PAC界。