Model-based reinforcement learning uses models to plan, where the predictions and policies of an agent can be improved by using more computation without additional data from the environment, thereby improving sample efficiency. However, learning accurate estimates of the model is hard. Subsequently, the natural question is whether we can get similar benefits as planning with model-free methods. Experience replay is an essential component of many model-free algorithms enabling sample-efficient learning and stability by providing a mechanism to store past experiences for further reuse in the gradient computational process. Prior works have established connections between models and experience replay by planning with the latter. This involves increasing the number of times a mini-batch is sampled and used for updates at each step (amount of replay per step). We attempt to exploit this connection by doing a systematic study on the effect of varying amounts of replay per step in a well-known model-free algorithm: Deep Q-Network (DQN) in the Mountain Car environment. We empirically show that increasing replay improves DQN's sample efficiency, reduces the variation in its performance, and makes it more robust to change in hyperparameters. Altogether, this takes a step toward a better algorithm for deployment.
翻译:基于模型的强化学习利用模型进行规划,通过增加计算量(无需额外环境数据)可提升智能体的预测与策略性能,从而提高样本效率。然而,准确估计模型参数极具挑战。因此,一个自然的问题是:能否借助无模型方法获得与规划类似的收益?经验回放是无模型算法中不可或缺的组件,其通过存储历史经验并在梯度计算过程中重复利用,实现了样本高效学习与稳定性。已有研究通过利用经验回放进行规划,建立了模型与经验回放之间的联系,具体方法是在每一步增加小批量数据的采样与更新次数(即每步回放量)。本文系统探究了经典无模型算法——深度Q网络(DQN)在Mountain Car环境中每步回放量变化的影响,旨在深入挖掘这一联系。实证结果表明:增加回放量能提升DQN的样本效率,降低其性能波动,并增强其对超参数变化的鲁棒性。总体而言,本研究为部署更优算法迈出了重要一步。