Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is governed by a policy with the same structure. Assuming that the nonlinear component comprises kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. Although the cost function is nonconvex in general, we establish the local strong convexity and smoothness in the vicinity of the global optimizer. Additionally, we propose an initialization mechanism to leverage these properties. Building on the developments, we design a policy gradient algorithm that is guaranteed to converge to the globally optimal policy with a linear rate.
翻译:非线性控制系统在部分信息可得于决策者的场景中广泛存在于各类应用中。为研究此类非线性系统,本文探索了在近乎线性二次调节器系统中寻找最优策略的强化学习方法。具体而言,我们考虑一个由线性与非线性组件共同构成的动态系统,其控制策略具有相同结构。在假设非线性组件包含具有小Lipschitz系数的核函数的前提下,我们刻画了代价函数的优化景观。尽管代价函数通常非凸,我们证明了其在全局最优解附近具有局部强凸性与光滑性。此外,我们提出了一种初始化机制以利用这些性质。基于上述进展,我们设计了一种策略梯度算法,该算法可保证以线性速率收敛至全局最优策略。