We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage functions directly from data, without unrealistic structural assumptions on the dynamics. Leveraging the ellipticity of the diffusions, we establish a new class of Hilbert-space positive definiteness and boundedness properties for the Bellman operators. Based on these properties, we propose the Sobolev-prox fitted $q$-learning algorithm, which learns value and advantage functions by iteratively solving least-squares regression problems. We derive oracle inequalities for the estimation error, governed by (i) the best approximation error of the function classes, (ii) their localized complexity, (iii) exponentially decaying optimization error, and (iv) numerical discretization error. These results identify ellipticity as a key structural property that renders reinforcement learning with function approximation for Markov diffusions no harder than supervised learning.
翻译:本研究针对具有离散时间观测与动作的连续时间马尔可夫扩散过程,探讨离策略强化学习控制问题。我们考虑采用函数逼近的无模型算法,该算法直接从数据中学习价值函数与优势函数,无需对动态过程施加不切实际的结构性假设。通过利用扩散过程的椭圆性,我们为贝尔曼算子建立了一类新的希尔伯特空间正定性与有界性性质。基于这些性质,我们提出Sobolev-prox拟合$q$学习算法,该算法通过迭代求解最小二乘回归问题来学习价值函数与优势函数。我们推导了估计误差的oracle不等式,其误差界由以下因素决定:(i) 函数类的最佳逼近误差,(ii) 其局部化复杂度,(iii) 指数衰减的优化误差,以及(iv) 数值离散化误差。这些结果表明椭圆性是关键结构性质,它使得马尔可夫扩散过程的函数逼近强化学习问题难度不高于监督学习。