The exploration/exploitation trade-off is an inherent challenge in data-driven adaptive control. Though this trade-off has been studied for multi-armed bandits (MAB's) and reinforcement learning for linear systems; it is less well-studied for learning-based control of nonlinear systems. A significant theoretical challenge in the nonlinear setting is that there is no explicit characterization of an optimal controller for a given set of cost and system parameters. We propose the use of a finite-horizon oracle controller with full knowledge of parameters as a reasonable surrogate to optimal controller. This allows us to develop policies in the context of learning-based MPC and MAB's and conduct a control-theoretic analysis using techniques from MPC- and optimization-theory to show these policies achieve low regret with respect to this finite-horizon oracle. Our simulations exhibit the low regret of our policy on a heating, ventilation, and air-conditioning model with partially-unknown cost function.
翻译:探索与利用的权衡是数据驱动自适应控制中固有的挑战。尽管这一权衡已在多臂赌博机(MAB)和线性系统的强化学习中得到研究,但在基于学习的非线性系统控制中尚未得到充分探讨。非线性环境中的一个重大理论挑战是:对于给定的成本与系统参数集,不存在最优控制器的显式表达。我们提出,将完全知晓参数的有穷时域预言控制器作为最优控制器的合理替代方案。这使得我们能够在基于学习的模型预测控制(MPC)和MAB框架下开发策略,并利用MPC和优化理论中的方法进行控制理论分析,以证明这些策略相对于该有穷时域预言控制器能够实现低遗憾。我们的仿真实验在具有部分未知成本函数的供暖、通风与空调系统模型上展示了所提策略的低遗憾特性。