We devise a control-theoretic reinforcement learning approach to support direct learning of the optimal policy. We establish various theoretical properties of our approach, such as convergence and optimality of our analog of the Bellman operator and Q-learning, a new control-policy-variable gradient theorem, and a specific gradient ascent algorithm based on this theorem within the context of a specific control-theoretic framework. We empirically evaluate the performance of our control theoretic approach on several classical reinforcement learning tasks, demonstrating significant improvements in solution quality, sample complexity, and running time of our approach over state-of-the-art methods.
翻译:我们提出了一种控制理论驱动的强化学习方法,用于直接学习最优策略。建立了该方法的多项理论性质,包括类Bellman算子与Q学习的收敛性与最优性、新的控制-策略-变量梯度定理,以及基于该定理在特定控制理论框架下的具体梯度上升算法。通过在多个经典强化学习任务上的实验评估,证明了该方法在解质量、样本复杂度和运行时间上相较于最先进方法的显著提升。