Many practical applications of online reinforcement learning require the satisfaction of safety constraints while learning about the unknown environment. In this work, we study Linear Quadratic Regulator (LQR) learning with unknown dynamics, but with the additional constraint that the position must stay within a safe region for the entire trajectory with high probability. Unlike in previous works, we allow for both bounded and unbounded noise distributions and study stronger baselines of nonlinear controllers that are better suited for constrained problems than linear controllers. Due to these complications, we focus on 1-dimensional state- and action- spaces, however we also discuss how we expect the high-level takeaways can generalize to higher dimensions. Our primary contribution is the first $\tilde{O}_T(\sqrt{T})$-regret bound for constrained LQR learning, which we show relative to a specific baseline of non-linear controllers. We then prove that, for any non-linear baseline satisfying natural assumptions, $\tilde{O}_T(\sqrt{T})$-regret is possible when the noise distribution has sufficiently large support and $\tilde{O}_T(T^{2/3})$-regret is possible for any subgaussian noise distribution. An overarching theme of our results is that enforcing safety provides "free exploration" that compensates for the added cost of uncertainty in safety constrained control, resulting in the same regret rate as in the unconstrained problem.
翻译:在线强化学习的许多实际应用要求在探索未知环境的同时满足安全约束。本文研究具有未知动态的线性二次调节器(LQR)学习问题,并附加约束条件:在整个轨迹中,系统状态必须以高概率保持在安全区域内。与先前工作不同,我们同时允许有界和无界噪声分布,并研究比线性控制器更适合约束问题的非线性控制器基线。由于这些复杂性,我们聚焦于一维状态空间与动作空间,但也讨论了如何将高层结论推广到更高维度。我们的主要贡献是首次证明了约束LQR学习具有$\tilde{O}_T(\sqrt{T})$遗憾界,该结论是相对于特定非线性控制器基线建立的。我们进一步证明,对于满足自然假设的任意非线性控制器基线,当噪声分布支撑集足够大时可能实现$\tilde{O}_T(\sqrt{T})$遗憾,而对任意亚高斯噪声分布可能实现$\tilde{O}_T(T^{2/3})$遗憾。我们研究结果的核心主题是:安全约束的实施提供了"免费探索"机制,这补偿了安全约束控制中不确定性带来的额外成本,从而实现了与无约束问题相同的遗憾率。