In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, L-smoothness, and M-Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss-Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when initialized near such minima. The paper concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning.
翻译:近年来,针对线性时不变系统的最优控制,策略梯度方法的优化景观研究取得了显著进展。相较于状态反馈控制,输出反馈控制更为普遍,因为在许多实际场景中系统的底层状态可能无法完全观测。本文分析了在二次成本条件下,将策略梯度方法应用于离散时间线性时不变系统的静态输出反馈(SOF)控制时,其固有的优化景观。我们首先建立了SOF成本的关键性质,包括强制力、L-光滑性和M-利普希茨连续海森矩阵。尽管缺乏凸性,我们利用这些性质导出了关于三种策略梯度方法(包括原始策略梯度方法、自然策略梯度方法和高斯-牛顿方法)收敛至驻点(以及近乎无维度的收敛速率)的新发现。此外,我们证明了原始策略梯度方法在初始点接近局部极小值时,会线性收敛至该极小值。论文最后通过数值算例验证了我们的理论发现。这些结果不仅刻画了梯度下降在优化SOF问题中的性能,还为强化学习领域中一般策略梯度方法的有效性提供了洞见。