Multi-objective reinforcement learning (MORL) seeks to learn policies that balance multiple, often conflicting objectives. Although a single preference-conditioned policy is the most flexible and scalable solution, existing approaches remain brittle in practice, frequently failing to recover complete Pareto fronts. We show that this failure stems from two structural issues in current methods: destructive gradient interference caused by premature scalarization and representational collapse across the preference space. We introduce $D^3PO$, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly. $D^3PO$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces sensitivity of policy behavior to preference changes, preventing collapse. Across standard MORL benchmarks, including high-dimensional and many-objective control tasks, $D^3PO$ consistently discovers broader and higher-quality Pareto fronts than prior single- and multi-policy methods, matching or exceeding state-of-the-art hypervolume and expected utility while using a single deployable policy.
翻译:多目标强化学习(MORL)旨在学习能够平衡多个(通常是相互冲突的)目标的策略。虽然单一偏好条件化策略是最灵活且可扩展的解决方案,但现有方法在实践中仍然脆弱,常常无法恢复完整的帕累托前沿。我们证明,这种失败源于当前方法中的两个结构性问题:由过早标量化引起的破坏性梯度干扰,以及偏好空间上的表征崩溃。我们引入了$D^3PO$,这是一个基于PPO的框架,它重组了多目标策略优化以直接解决这些问题。$D^3PO$通过一个分解的优化流程保留了每个目标的学习信号,并仅在稳定后才整合偏好,从而实现可靠的信用分配。此外,一个缩放多样性正则化器强制策略行为对偏好变化保持敏感,防止崩溃。在包括高维和多目标控制任务在内的标准MORL基准测试中,$D^3PO$始终比先前的单策略和多策略方法发现更广泛、更高质量的帕累托前沿,在仅使用单一可部署策略的情况下,匹配或超越了最先进的超体积和期望效用。