The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives. In practice, an agent's preferences over the objectives may not be known apriori, and hence, we require policies that can generalize to arbitrary preferences at test time. In this work, we propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent using only a finite dataset of offline demonstrations of other agents and their preferences. The key contributions of this work are two-fold. First, we introduce D4MORL, (D)atasets for MORL that are specifically designed for offline settings. It contains 1.8 million annotated demonstrations obtained by rolling out reference policies that optimize for randomly sampled preferences on 6 MuJoCo environments with 2-3 objectives each. Second, we propose Pareto-Efficient Decision Agents (PEDA), a family of offline MORL algorithms that builds and extends Decision Transformers via a novel preference-and-return-conditioned policy. Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics.
翻译:多目标强化学习(MORL)的目标是学习能够同时优化多个冲突目标策略。在实践中,智能体对目标的偏好可能无法事先知晓,因此我们需要能够泛化到测试时任意偏好的策略。本文提出了一种新的离线MORL数据驱动框架,旨在仅利用其他智能体的离线演示数据集及其偏好来学习偏好无关的策略智能体。本文的主要贡献有两方面:首先,我们引入了D4MORL,这是专为离线设置设计的MORL数据集。该数据集包含180万个带注释的演示样本,这些样本是通过在6个MuJoCo环境(每个环境包含2-3个目标)中执行优化随机采样偏好的参考策略而获得的。其次,我们提出了帕累托高效决策智能体(PEDA),这是一系列离线MORL算法,通过新颖的偏好与回报条件策略构建并扩展了决策Transformer。实验表明,在D4MORL基准测试中,PEDA能够紧密逼近行为策略,并通过适当条件化实现帕累托前沿的优异近似,这通过超体积和稀疏性指标得到了验证。