The first algorithm for the Linear Quadratic (LQ) control problem with an unknown system model, featuring a regret of $\mathcal{O}(\sqrt{T})$, was introduced by Abbasi-Yadkori and Szepesv\'ari (2011). Recognizing the computational complexity of this algorithm, subsequent efforts (see Cohen et al. (2019), Mania et al. (2019), Faradonbeh et al. (2020a), and Kargin et al.(2022)) have been dedicated to proposing algorithms that are computationally tractable while preserving this order of regret. Although successful, the existing works in the literature lack a fully adaptive exploration-exploitation trade-off adjustment and require a user-defined value, which can lead to overall regret bound growth with some factors. In this work, noticing this gap, we propose the first fully adaptive algorithm that controls the number of policy updates (i.e., tunes the exploration-exploitation trade-off) and optimizes the upper-bound of regret adaptively. Our proposed algorithm builds on the SDP-based approach of Cohen et al. (2019) and relaxes its need for a horizon-dependant warm-up phase by appropriately tuning the regularization parameter and adding an adaptive input perturbation. We further show that through careful exploration-exploitation trade-off adjustment there is no need to commit to the widely-used notion of strong sequential stability, which is restrictive and can introduce complexities in initialization.
翻译:首个针对未知系统模型的线性二次(LQ)控制问题、且遗憾界为$\mathcal{O}(\sqrt{T})$的算法由Abbasi-Yadkori与Szepesvári(2011)提出。鉴于该算法的计算复杂性,后续研究(参见Cohen等(2019)、Mania等(2019)、Faradonbeh等(2020a)及Kargin等(2022))致力于提出计算可处理且保持相同遗憾阶的算法。尽管取得了成功,现有文献中的方法缺乏完全自适应的探索-利用权衡调整机制,且需要用户定义参数,这可能导致整体遗憾界随某些因子增长。本文针对这一空白,提出首个全自适应算法:该算法能控制策略更新次数(即调节探索-利用权衡)并自适应优化遗憾上界。本算法基于Cohen等(2019)的SDP方法,通过适当调整正则化参数并添加自适应输入扰动,消除了其对依赖时域预热阶段的需求。我们进一步证明,通过精细的探索-利用权衡调整,无需采用广泛使用的强序列稳定性概念——该概念具有限制性且可能在初始化阶段引入复杂性。