In this paper, we aim to utilize only offline trajectory data to train a policy for multi-objective RL. We extend the offline policy-regularized method, a widely-adopted approach for single-objective offline RL problems, into the multi-objective setting in order to achieve the above goal. However, such methods face a new challenge in offline MORL settings, namely the preference-inconsistent demonstration problem. We propose two solutions to this problem: 1) filtering out preference-inconsistent demonstrations via approximating behavior preferences, and 2) adopting regularization techniques with high policy expressiveness. Moreover, we integrate the preference-conditioned scalarized update method into policy-regularized offline RL, in order to simultaneously learn a set of policies using a single policy network, thus reducing the computational cost induced by the training of a large number of individual policies for various preferences. Finally, we introduce Regularization Weight Adaptation to dynamically determine appropriate regularization weights for arbitrary target preferences during deployment. Empirical results on various multi-objective datasets demonstrate the capability of our approach in solving offline MORL problems.
翻译:本文旨在仅利用离线轨迹数据训练多目标强化学习策略。我们将离线策略正则化方法(单目标离线强化学习中广泛采用的方法)扩展到多目标场景以实现上述目标。然而,此类方法在离线多目标强化学习场景中面临新的挑战,即偏好不一致示范问题。我们提出两种解决方案:1)通过近似行为偏好过滤偏好不一致的示范;2)采用高策略表达能力的正则化技术。此外,我们将偏好条件标量化更新方法集成到策略正则化离线强化学习中,通过单一策略网络同时学习一组策略,从而降低因针对不同偏好训练大量独立策略而产生的计算成本。最终,我们引入正则化权重自适应机制,在部署阶段为任意目标偏好动态确定合适的正则化权重。在多种多目标数据集上的实验结果表明,我们的方法能够有效解决离线多目标强化学习问题。