In this paper, we consider reinforcement learning of nonlinear systems with continuous state and action spaces. We present an episodic learning algorithm, where we for each episode use convex optimization to find a two-layer neural network approximation of the optimal $Q$-function. The convex optimization approach guarantees that the weights calculated at each episode are optimal, with respect to the given sampled states and actions of the current episode. For stable nonlinear systems, we show that the algorithm converges and that the converging parameters of the trained neural network can be made arbitrarily close to the optimal neural network parameters. In particular, if the regularization parameter is $\rho$ and the time horizon is $T$, then the parameters of the trained neural network converge to $w$, where the distance between $w$ from the optimal parameters $w^\star$ is bounded by $\mathcal{O}(\rho T^{-1})$. That is, when the number of episodes goes to infinity, there exists a constant $C$ such that \[\|w-w^\star\| \le C\cdot\frac{\rho}{T}.\] In particular, our algorithm converges arbitrarily close to the optimal neural network parameters as the time horizon increases or as the regularization parameter decreases.
翻译:在本文中,我们考虑具有连续状态和动作空间的非线性系统的强化学习问题。我们提出一种情节式学习算法,在每个情节中使用凸优化来找到最优Q函数的两层神经网络近似。该凸优化方法确保了每个情节计算出的权重相对于当前情节给定的采样状态和动作是最优的。对于稳定的非线性系统,我们证明该算法收敛,并且训练后的神经网络收敛参数可以任意接近最优神经网络参数。特别地,若正则化参数为ρ且时间范围为T,则训练后的神经网络参数收敛于w,其中w与最优参数w^*之间的距离由O(ρT^{-1})界定。即当情节数趋于无穷时,存在常数C使得\[ \|w-w^*\| \le C\cdot\frac{\rho}{T} \]。特别地,随着时间范围增加或正则化参数减小,我们的算法收敛到任意接近最优神经网络参数的值。