We devise a control-theoretic reinforcement learning approach to support direct learning of the optimal policy. We establish various theoretical properties of our approach, such as convergence and optimality of our analog of the Bellman operator and Q-learning, a new control-policy-variable gradient theorem, and a specific gradient ascent algorithm based on this theorem within the context of a specific control-theoretic framework. We empirically evaluate the performance of our control theoretic approach on several classical reinforcement learning tasks, demonstrating significant improvements in solution quality, sample complexity, and running time of our approach over state-of-the-art methods.
翻译:我们设计了一种控制论强化学习方法,以支持直接学习最优策略。我们建立了该方法的各种理论性质,例如我们提出的贝尔曼算子与Q学习的收敛性与最优性、一个新的控制策略变量梯度定理,以及在该控制论框架下基于此定理的具体梯度上升算法。我们在多个经典强化学习任务上实证评估了控制论方法的性能,结果表明该方法在求解质量、样本复杂度和运行时间方面均显著优于现有最优方法。