Zero-sum Linear Quadratic (LQ) games are fundamental in optimal control and can be used (i) as a dynamic game formulation for risk-sensitive or robust control, or (ii) as a benchmark setting for multi-agent reinforcement learning with two competing agents in continuous state-control spaces. In contrast to the well-studied single-agent linear quadratic regulator problem, zero-sum LQ games entail solving a challenging nonconvex-nonconcave min-max problem with an objective function that lacks coercivity. Recently, Zhang et al. discovered an implicit regularization property of natural policy gradient methods which is crucial for safety-critical control systems since it preserves the robustness of the controller during learning. Moreover, in the model-free setting where the knowledge of model parameters is not available, Zhang et al. proposed the first polynomial sample complexity algorithm to reach an $\epsilon$-neighborhood of the Nash equilibrium while maintaining the desirable implicit regularization property. In this work, we propose a simpler nested Zeroth-Order (ZO) algorithm improving sample complexity by several orders of magnitude. Our main result guarantees a $\widetilde{\mathcal{O}}(\epsilon^{-3})$ sample complexity under the same assumptions using a single-point ZO estimator. Furthermore, when the estimator is replaced by a two-point estimator, our method enjoys a better $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity. Our key improvements rely on a more sample-efficient nested algorithm design and finer control of the ZO natural gradient estimation error.
翻译:零和线性二次型(LQ)博弈是最优控制中的基础内容,可应用于:(i)作为风险敏感或鲁棒控制的动态博弈形式,或(ii)作为连续状态-控制空间中双竞争智能体多智能体强化学习的基准场景。与广泛研究的单智能体线性二次型调节器问题不同,零和LQ博弈需要解决一个挑战性的非凸-非凹极小极大问题,其目标函数缺乏强制性。近期,Zhang等人发现自然策略梯度方法具有隐式正则化性质,这对安全关键控制系统至关重要,因其在学习过程中保持了控制器的鲁棒性。此外,在模型参数未知的无模型设定下,Zhang等人提出了首个多项式样本复杂度算法,在保持理想隐式正则化性质的同时达到纳什均衡的$\epsilon$-邻域。本研究提出了一种更简单的嵌套零阶(ZO)算法,将样本复杂度提升了数个数量级。我们的主要结果保证在相同假设下,使用单点ZO估计器可实现$\widetilde{\mathcal{O}}(\epsilon^{-3})$的样本复杂度。进一步地,当采用双点估计器时,该方法可获得更优的$\widetilde{\mathcal{O}}(\epsilon^{-2})$样本复杂度。关键改进依赖于更高效的嵌套算法设计和ZO自然梯度估计误差的精细化控制。