Offline reinforcement learning (RL) allows learning sequential behavior from fixed datasets. Since offline datasets do not cover all possible situations, many methods collect additional data during online fine-tuning to improve performance. In general, these methods assume that the transition dynamics remain the same during both the offline and online phases of training. However, in many real-world applications, such as outdoor construction and navigation over rough terrain, it is common for the transition dynamics to vary between the offline and online phases. Moreover, the dynamics may vary during the online fine-tuning. To address this problem of changing dynamics from offline to online RL we propose a residual learning approach that infers dynamics changes to correct the outputs of the offline solution. At the online fine-tuning phase, we train a context encoder to learn a representation that is consistent inside the current online learning environment while being able to predict dynamic transitions. Experiments in D4RL MuJoCo environments, modified to support dynamics' changes upon environment resets, show that our approach can adapt to these dynamic changes and generalize to unseen perturbations in a sample-efficient way, whilst comparison methods cannot.
翻译:离线强化学习(RL)允许从固定数据集中学习序列行为。由于离线数据集无法覆盖所有可能情况,许多方法在在线微调阶段收集额外数据以提升性能。一般而言,这些方法假设状态转移动力学在离线与在线训练阶段保持不变。然而,在许多现实应用中(例如户外施工与崎岖地形导航),离线与在线阶段间的动力学特性常发生变化。此外,动力学特性在在线微调过程中也可能持续变化。为解决从离线到在线RL中动力学变化的问题,我们提出一种残差学习方法,通过推断动力学变化来修正离线策略的输出。在在线微调阶段,我们训练一个上下文编码器来学习当前在线环境内一致的表征,同时能够预测动态转移。在D4RL MuJoCo环境(经修改以支持环境重置时的动力学变化)中的实验表明,我们的方法能够以样本高效的方式适应这些动态变化并泛化至未见扰动,而对比方法则无法实现。