We study the global linear convergence of policy gradient (PG) methods for finite-horizon continuous-time exploratory linear-quadratic control (LQC) problems. The setting includes stochastic LQC problems with indefinite costs and allows additional entropy regularisers in the objective. We consider a continuous-time Gaussian policy whose mean is linear in the state variable and whose covariance is state-independent. Contrary to discrete-time problems, the cost is noncoercive in the policy and not all descent directions lead to bounded iterates. We propose geometry-aware gradient descents for the mean and covariance of the policy using the Fisher geometry and the Bures-Wasserstein geometry, respectively. The policy iterates are shown to satisfy an a-priori bound, and converge globally to the optimal policy with a linear rate. We further propose a novel PG method with discrete-time policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies. A numerical experiment confirms the convergence and robustness of the proposed algorithm.
翻译:我们研究了有限时间连续时间探索性线性二次控制问题的策略梯度方法的全局线性收敛性。该设置包括具有不定成本的随机线性二次控制问题,并允许目标函数中包含额外的熵正则化项。我们考虑一种连续时间高斯策略,其均值是状态变量的线性函数,协方差与状态无关。与离散时间问题相反,该成本在策略上并非强制性的,且并非所有下降方向都能导致有界迭代。我们分别利用Fisher几何和Bures-Wasserstein几何,提出了针对策略均值与协方差的几何感知梯度下降法。策略迭代满足先验有界性,并以线性速率全局收敛到最优策略。我们进一步提出一种具有离散时间策略的新型策略梯度方法。该算法借助连续时间分析,在不同动作频率下均能实现稳健的线性收敛。数值实验验证了所提算法的收敛性和鲁棒性。