We investigate the problem of learning an $\epsilon$-approximate solution for the discrete-time Linear Quadratic Regulator (LQR) problem via a Stochastic Variance-Reduced Policy Gradient (SVRPG) approach. Whilst policy gradient methods have proven to converge linearly to the optimal solution of the model-free LQR problem, the substantial requirement for two-point cost queries in gradient estimations may be intractable, particularly in applications where obtaining cost function evaluations at two distinct control input configurations is exceptionally costly. To this end, we propose an oracle-efficient approach. Our method combines both one-point and two-point estimations in a dual-loop variance-reduced algorithm. It achieves an approximate optimal solution with only $O\left(\log\left(1/\epsilon\right)^{\beta}\right)$ two-point cost information for $\beta \in (0,1)$.
翻译:我们研究了通过随机方差缩减策略梯度(SVRPG)方法学习离散时间线性二次型调节器(LQR)问题的$\epsilon$-近似解的问题。虽然策略梯度方法已被证明在线性收敛到无模型LQR问题的最优解,但在梯度估计中需要大量两点成本查询,这在某些应用中可能难以实现,特别是在获取两个不同控制输入配置下的成本函数评估异常昂贵的场景。为此,我们提出了一种高效oracle的方法。我们的方法在一个双循环方差缩减算法中结合了单点和两点估计,仅需$O\left(\log\left(1/\epsilon\right)^{\beta}\right)$次两点成本信息(其中$\beta \in (0,1)$)即可达到近似最优解。