In this paper, we consider reinforcement learning of nonlinear systems with continuous state and action spaces. We present an episodic learning algorithm, where we for each episode use convex optimization to find a two-layer neural network approximation of the optimal $Q$-function. The convex optimization approach guarantees that the weights calculated at each episode are optimal, with respect to the given sampled states and actions of the current episode. For stable nonlinear systems, we show that the algorithm converges and that the converging parameters of the trained neural network can be made arbitrarily close to the optimal neural network parameters. In particular, if the regularization parameter in the training phase is given by $\rho$, then the parameters of the trained neural network converge to $w$, where the distance between $w$ and the optimal parameters $w^\star$ is bounded by $\mathcal{O}(\rho)$. That is, when the number of episodes goes to infinity, there exists a constant $C$ such that \[ \|w-w^\star\| \le C\rho. \] In particular, our algorithm converges arbitrarily close to the optimal neural network parameters as the regularization parameter goes to zero. As a consequence, our algorithm converges fast due to the polynomial-time convergence of convex optimization algorithms.
翻译:本文研究了具有连续状态空间和动作空间的非线性系统的强化学习问题。我们提出了一种分段学习算法,其中在每个分段中利用凸优化来寻找最优$Q$函数的两层神经网络近似。该凸优化方法保证了每个分段计算出的权重对于当前分段给定的采样状态和动作是最优的。对于稳定的非线性系统,我们证明了该算法具有收敛性,且训练神经网络的收敛参数可以无限逼近最优神经网络参数。具体而言,若训练阶段的正则化参数为$\rho$,则训练神经网络的参数收敛至$w$,其中$w$与最优参数$w^\star$之间的距离以$\mathcal{O}(\rho)$为界。即当分段数趋于无穷时,存在常数$C$使得\[ \|w-w^\star\| \le C\rho. \] 特别地,当正则化参数趋于零时,我们的算法可以无限逼近最优神经网络参数。因此,基于凸优化算法的多项式时间收敛特性,本算法具有快速收敛的优势。