We study the operator-theoretic core of Q-learning in continuous-time stochastic control with continuous states and actions. In value-based reinforcement learning, each Q-learning or DQN update is built from a Bellman optimality target; our analysis isolates this target in a diffusion setting and studies its regularity and approximation complexity. Under uniform ellipticity and Hölder-regular coefficients, we show that a Bellman update maps bounded inputs into an anisotropic regularity class, smoothing the state variable while leaving only Lipschitz dependence on the action variable. This yields a compact family of Bellman iterates and motivates a tensor-product DeepONet architecture adapted to the mixed regularity of the problem. We then derive explicit approximation and resource bounds, together with a stiffness--complexity trade-off as the time step $δ\to 0$. The resulting theory makes a direct contribution to Q-learning theory at the level of Bellman target regularity and approximation in continuous stochastic control. At the same time, we do not claim a full convergence theorem for practical sampled Q-learning with exploration, replay, and stochastic gradient updates.
翻译:我们研究了具有连续状态和动作的连续时间随机控制中Q学习的算子理论核心。在基于价值的强化学习中,每次Q学习或DQN更新都基于贝尔曼最优性目标构建;我们的分析在扩散环境中孤立了这一目标,并研究了其正则性和逼近复杂度。在一致椭圆性和Hölder正则系数假设下,我们证明了贝尔曼更新将有界输入映射到各向异性正则类中,平滑了状态变量,而动作变量仅保留利普希茨依赖性。这产生了一个紧的贝尔曼迭代族,并激发了适应问题混合正则性的张量积DeepONet架构。随后我们推导了显式的逼近和资源界,以及时间步长$δ\to 0$时的刚性-复杂度权衡。所得理论在连续随机控制中的贝尔曼目标正则性和逼近层面,对Q学习理论做出了直接贡献。同时,我们并未声称对于包含探索、经验回放和随机梯度更新的实际采样Q学习存在完整的收敛定理。