In model-based reinforcement learning, most algorithms rely on simulating trajectories from one-step models of the dynamics learned on data. A critical challenge of this approach is the compounding of one-step prediction errors as the length of the trajectory grows. In this paper we tackle this issue by using a multi-step objective to train one-step models. Our objective is a weighted sum of the mean squared error (MSE) loss at various future horizons. We find that this new loss is particularly useful when the data is noisy (additive Gaussian noise in the observations), which is often the case in real-life environments. To support the multi-step loss, first we study its properties in two tractable cases: i) uni-dimensional linear system, and ii) two-parameter non-linear system. Second, we show in a variety of tasks (environments or datasets) that the models learned with this loss achieve a significant improvement in terms of the averaged R2-score on future prediction horizons. Finally, in the pure batch reinforcement learning setting, we demonstrate that one-step models serve as strong baselines when dynamics are deterministic, while multi-step models would be more advantageous in the presence of noise, highlighting the potential of our approach in real-world applications.
翻译:在基于模型的强化学习中,大多数算法依赖于从数据中学到的动力学单步模型来模拟轨迹。该方法的一个关键挑战是,随着轨迹长度的增加,单步预测误差会不断累积。本文通过使用多步目标函数来训练单步模型以解决这一问题。我们的目标函数是不同未来时段均方误差(MSE)损失的加权和。我们发现,当数据存在噪声(观测中的加性高斯噪声)时,这种新损失函数尤其有用,而这在现实环境中经常出现。为支持多步损失函数,我们首先在两个可处理案例中研究其性质:i)一维线性系统,和ii)双参数非线性系统。其次,我们在多种任务(环境或数据集)中表明,使用该损失函数学习的模型在未来预测时段上的平均R²分数方面取得了显著改进。最后,在纯批处理强化学习设置中,我们证明当动力学为确定性时,单步模型可作为强基线,而在存在噪声的情况下,多步模型更具优势,突显了我们的方法在现实应用中的潜力。