We propose a computationally efficient algorithm that achieves anytime regret of order $\mathcal{O}(\sqrt{t})$, with explicit dependence on the system dimensions and on the solution of the Discrete Algebraic Riccati Equation (DARE). Our approach builds on the SDP-based framework of \cite{cohen2019learning}, using an appropriately tuned regularization and a sufficiently accurate initial estimate to construct confidence ellipsoids for control design. A carefully designed input-perturbation mechanism is incorporated to ensure anytime performance. We develop two variants of the algorithm. The first enforces a notion of strong sequential stability, requiring each policy to be stabilizing and successive policies to remain close. However, enforcing this notion results in a suboptimal regret scaling. The second removes the sequential-stability requirement and instead requires only that each generated policy be stabilizing. Closed-loop stability is then preserved through a dwell-time-inspired policy-update rule, adapting ideas from switched-systems control to carefully balance exploration and exploitation. This class of algorithms also addresses key shortcomings of most existing approaches including certainty-equivalence-based methods which typically guarantee stability only in the Lyapunov sense and lack explicit uniform high-probability bounds on the state trajectory expressed in system-theoretic terms. Our analysis explicitly characterizes the trade-off between state amplification and regret, and shows that partially relaxing the sequential-stability requirement yields optimal regret. Finally, our method eliminates the need for any a priori bound on the norm of the DARE solution, an assumption required by all existing computationally efficient optimism in the face of uncertainty (OFU) based algorithms, and thereby removes the reliance of regret guarantees on such external inputs.
翻译:我们提出一种计算高效的算法,该算法可实现阶数为 $\mathcal{O}(\sqrt{t})$ 的任意时间遗憾,并明确依赖于系统维数及离散代数Riccati方程 (DARE) 的解。我们的方法建立在 \cite{cohen2019learning} 基于SDP的框架之上,通过适当调整的正则化和足够精确的初始估计来构建用于控制设计的置信椭球。算法中精心设计了一个输入扰动机制以确保任意时间性能。我们开发了该算法的两种变体。第一种强制执行一种强序贯稳定性概念,要求每个策略都是稳定的,且连续策略保持接近。然而,强制执行此概念会导致次优的遗憾缩放。第二种变体取消了序贯稳定性要求,仅要求每个生成的策略是稳定的。然后,通过一种受驻留时间启发的策略更新规则来保持闭环稳定性,借鉴切换系统控制的思想来仔细平衡探索与利用。此类算法还解决了大多数现有方法(包括基于确定性等价的方法)的关键缺点,这些方法通常仅在Lyapunov意义上保证稳定性,并且缺乏用系统理论术语表述的、关于状态轨迹的显式一致高概率界。我们的分析明确刻画了状态放大与遗憾之间的权衡,并表明部分放宽序贯稳定性要求可产生最优遗憾。最后,我们的方法消除了对DARE解范数的任何先验界的需求,这是所有现有基于面对不确定性的乐观 (OFU) 的计算高效算法所需的假设,从而消除了遗憾保证对此类外部输入的依赖。