The convergence of deterministic policy gradient under the Hadamard parameterization is studied in the tabular setting and the linear convergence of the algorithm is established. To this end, we first show that the error decreases at an $O(\frac{1}{k})$ rate for all the iterations. Based on this result, we further show that the algorithm has a faster local linear convergence rate after $k_0$ iterations, where $k_0$ is a constant that only depends on the MDP problem and the initialization. To show the local linear convergence of the algorithm, we have indeed established the contraction of the sub-optimal probability $b_s^k$ (i.e., the probability of the output policy $\pi^k$ on non-optimal actions) when $k\ge k_0$.
翻译:本文在表格设置下研究了Hadamard参数化下确定性策略梯度的收敛性,并建立了该算法的线性收敛性。为此,我们首先证明在所有迭代中误差以$O(\frac{1}{k})$的速率下降。基于这一结果,我们进一步证明在$k_0$次迭代后算法具有更快的局部线性收敛速率,其中$k_0$是仅依赖于马尔可夫决策过程问题与初始化的常数。为证明算法的局部线性收敛性,我们实际建立了当$k\ge k_0$时次优概率$b_s^k$(即输出策略$\pi^k$在非最优动作上的概率)的收缩性。