Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is governed by a policy with the same structure. Assuming that the nonlinear component comprises kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. Although the cost function is nonconvex in general, we establish the local strong convexity and smoothness in the vicinity of the global optimizer. Additionally, we propose an initialization mechanism to leverage these properties. Building on the developments, we design a policy gradient algorithm that is guaranteed to converge to the globally optimal policy with a linear rate.
翻译:在多种应用中,决策者面临包含部分信息的非线性控制系统普遍存在。为向研究此类非线性系统迈进一步,本文探索了用于在近线性二次型调节器系统中寻找最优策略的强化学习方法。具体而言,我们考虑一个融合线性和非线性组件的动态系统,并由具有相同结构的策略所控制。假设非线性组件包含具有小Lipschitz系数的核函数,我们刻画了代价函数的优化景观。尽管代价函数通常为非凸,我们证明了其在全局最优解附近具有局部强凸性和光滑性。此外,我们提出了一种初始化机制以利用这些性质。基于上述进展,我们设计了一种策略梯度算法,该算法保证以线性速率收敛至全局最优策略。