Environment Transformer and Policy Optimization for Model-Based Offline Reinforcement Learning

Interacting with the actual environment to acquire data is often costly and time-consuming in robotic tasks. Model-based offline reinforcement learning (RL) provides a feasible solution. On the one hand, it eliminates the requirements of interaction with the actual environment. On the other hand, it learns the transition dynamics and reward function from the offline datasets and generates simulated rollouts to accelerate training. Previous model-based offline RL methods adopt probabilistic ensemble neural networks (NN) to model aleatoric uncertainty and epistemic uncertainty. However, this results in an exponential increase in training time and computing resource requirements. Furthermore, these methods are easily disturbed by the accumulative errors of the environment dynamics models when simulating long-term rollouts. To solve the above problems, we propose an uncertainty-aware sequence modeling architecture called Environment Transformer. It models the probability distribution of the environment dynamics and reward function to capture aleatoric uncertainty and treats epistemic uncertainty as a learnable noise parameter. Benefiting from the accurate modeling of the transition dynamics and reward function, Environment Transformer can be combined with arbitrary planning, dynamics programming, or policy optimization algorithms for offline RL. In this case, we perform Conservative Q-Learning (CQL) to learn a conservative Q-function. Through simulation experiments, we demonstrate that our method achieves or exceeds state-of-the-art performance in widely studied offline RL benchmarks. Moreover, we show that Environment Transformer's simulated rollout quality, sample efficiency, and long-term rollout simulation capability are superior to those of previous model-based offline RL methods.

翻译：在实际机器人任务中，与真实环境交互获取数据往往成本高昂且耗时。基于模型的离线强化学习提供了一种可行方案：一方面无需与真实环境交互，另一方面可从离线数据集中学习转移动力学与奖励函数，生成模拟轨迹以加速训练。现有基于模型的离线强化学习方法采用概率集成神经网络建模偶然不确定性与认知不确定性，但导致训练时间和计算资源需求呈指数级增长。此外，这些方法在模拟长程轨迹时易受环境动力学模型累积误差的干扰。为解决上述问题，我们提出一种名为环境变压器的不确定性感知序列建模架构。该架构对环境动力学与奖励函数的概率分布进行建模以捕获偶然不确定性，并将认知不确定性视为可学习噪声参数。凭借对转移动力学与奖励函数的精确建模，环境变压器可与任意规划算法、动态规划或策略优化算法结合应用于离线强化学习。在此框架下，我们采用保守Q学习学习保守Q函数。仿真实验表明，本方法在广泛研究的离线强化学习基准上达到或超越了现有最优性能。此外，环境变压器的模拟轨迹质量、样本效率及长程轨迹模拟能力均优于现有基于模型的离线强化学习方法。