In modern machine learning, models can often fit training data in numerous ways, some of which perform well on unseen (test) data, while others do not. Remarkably, in such cases gradient descent frequently exhibits an implicit bias that leads to excellent performance on unseen data. This implicit bias was extensively studied in supervised learning, but is far less understood in optimal control (reinforcement learning). There, learning a controller applied to a system via gradient descent is known as policy gradient, and a question of prime importance is the extent to which a learned controller extrapolates to unseen initial states. This paper theoretically studies the implicit bias of policy gradient in terms of extrapolation to unseen initial states. Focusing on the fundamental Linear Quadratic Regulator (LQR) problem, we establish that the extent of extrapolation depends on the degree of exploration induced by the system when commencing from initial states included in training. Experiments corroborate our theory, and demonstrate its conclusions on problems beyond LQR, where systems are non-linear and controllers are neural networks. We hypothesize that real-world optimal control may be greatly improved by developing methods for informed selection of initial states to train on.
翻译:在现代机器学习中,模型通常能以多种方式拟合训练数据,其中一些方法在未见(测试)数据上表现优异,而另一些则不然。值得注意的是,在这种情况下,梯度下降经常展现出一种隐式偏差,从而在未见数据上获得出色表现。这种隐式偏差已在监督学习中得到广泛研究,但在最优控制(强化学习)中却远未被充分理解。在后者中,通过梯度下降学习应用于系统的控制器被称为策略梯度,而一个至关重要的问题是所学控制器对未见初始状态的外推能力。本文从理论上研究了策略梯度在未见初始状态外推方面的隐式偏差。聚焦于基础线性二次调节器(LQR)问题,我们证明外推程度取决于系统从训练中包含的初始状态开始时引发的探索程度。实验结果印证了我们的理论,并展示了其结论在LQR之外的更广泛问题(如非线性系统和神经网络控制器)中的适用性。我们推测,通过开发知情选择训练初始状态的方法,可以显著改进现实世界中的最优控制。