We consider the problem of online adaptive control of the linear quadratic regulator, where the true system parameters are unknown. We prove new upper and lower bounds demonstrating that the optimal regret scales as $\widetilde{\Theta}({\sqrt{d_{\mathbf{u}}^2 d_{\mathbf{x}} T}})$, where $T$ is the number of time steps, $d_{\mathbf{u}}$ is the dimension of the input space, and $d_{\mathbf{x}}$ is the dimension of the system state. Notably, our lower bounds rule out the possibility of a $\mathrm{poly}(\log{}T)$-regret algorithm, which had been conjectured due to the apparent strong convexity of the problem. Our upper bound is attained by a simple variant of $\textit{{certainty equivalent control}}$, where the learner selects control inputs according to the optimal controller for their estimate of the system while injecting exploratory random noise. While this approach was shown to achieve $\sqrt{T}$-regret by (Mania et al. 2019), we show that if the learner continually refines their estimates of the system matrices, the method attains optimal dimension dependence as well. Central to our upper and lower bounds is a new approach for controlling perturbations of Riccati equations called the $\textit{self-bounding ODE method}$, which we use to derive suboptimality bounds for the certainty equivalent controller synthesized from estimated system dynamics. This in turn enables regret upper bounds which hold for $\textit{any stabilizable instance}$ and scale with natural control-theoretic quantities.
翻译:我们研究了线性二次型调节器的在线自适应控制问题,其中真实系统参数未知。我们证明了新的上下界,表明最优遗憾值量级为$\widetilde{\Theta}({\sqrt{d_{\mathbf{u}}^2 d_{\mathbf{x}} T}})$,其中$T$为时间步数,$d_{\mathbf{u}}$为输入空间维度,$d_{\mathbf{x}}$为系统状态维度。值得注意的是,我们的下界排除了存在$\mathrm{poly}(\log{}T)$-遗憾算法的可能性——该猜想此前因问题表观强凸性而被提出。我们的上界由一种简单的$\textit{确定性等价控制}$变体实现:学习器根据其系统估计的最优控制器选择控制输入,同时注入探索性随机噪声。尽管Mania等人(2019)已证明该方法能达到$\sqrt{T}$-遗憾,但我们表明,若学习器持续优化系统矩阵估计值,该方法还可获得最优维度依赖关系。支撑我们上下界的关键是一种控制Riccati方程摄动的新方法,称为$\textit{自边界常微分方程法}$,该方法用于推导基于估计系统动力学合成的确定性等价控制器的次优性界。这进而使得遗憾上界适用于$\textit{任意可镇定实例}$,并随自然控制理论量进行尺度缩放。