We consider the problem of offline reinforcement learning where only a set of system transitions is made available for policy optimization. Following recent advances in the field, we consider a model-based reinforcement learning algorithm that infers the system dynamics from the available data and performs policy optimization on imaginary model rollouts. This approach is vulnerable to exploiting model errors which can lead to catastrophic failures on the real system. The standard solution is to rely on ensembles for uncertainty heuristics and to avoid exploiting the model where it is too uncertain. We challenge the popular belief that we must resort to ensembles by showing that better performance can be obtained with a single well-calibrated autoregressive model on the D4RL benchmark. We also analyze static metrics of model-learning and conclude on the important model properties for the final performance of the agent.
翻译:本文研究离线强化学习问题,即仅通过一组系统状态转移数据进行策略优化。基于该领域的最新进展,我们采用一种基于模型的强化学习算法,从可用数据中推断系统动力学,并在虚拟模型推演上进行策略优化。该算法容易受模型误差的利用影响,可能导致在真实系统上出现灾难性故障。标准解决方案是依赖集成方法进行不确定性启发式评估,并避免在模型过于不确定的区域进行利用。我们挑战了必须采用集成方法的普遍观点,通过在D4RL基准测试中证明:单一校准良好的自回归模型能获得更优性能。此外,我们分析了模型学习的静态指标,并归纳出对智能体最终性能具有重要影响的模型关键特性。