Many advances that have improved the robustness and efficiency of deep reinforcement learning (RL) algorithms can, in one way or another, be understood as introducing additional objectives or constraints in the policy optimization step. This includes ideas as far ranging as exploration bonuses, entropy regularization, and regularization toward teachers or data priors. Often, the task reward and auxiliary objectives are in conflict, and in this paper we argue that this makes it natural to treat these cases as instances of multi-objective (MO) optimization problems. We demonstrate how this perspective allows us to develop novel and more effective RL algorithms. In particular, we focus on offline RL and finetuning as case studies, and show that existing approaches can be understood as MO algorithms relying on linear scalarization. We hypothesize that replacing linear scalarization with a better algorithm can improve performance. We introduce Distillation of a Mixture of Experts (DiME), a new MORL algorithm that outperforms linear scalarization and can be applied to these non-standard MO problems. We demonstrate that for offline RL, DiME leads to a simple new algorithm that outperforms state-of-the-art. For finetuning, we derive new algorithms that learn to outperform the teacher policy.
翻译:许多提升深度强化学习算法鲁棒性和效率的进展,在某种程度上可被理解为在策略优化步骤中引入额外目标或约束。这涵盖探索奖励、熵正则化以及向教师或数据先验进行正则化等广泛思路。任务奖励与辅助目标常存在冲突,本文认为这自然适合将其视为多目标优化问题的实例。我们论证了这一视角如何帮助开发新颖且更有效的强化学习算法。具体而言,我们以离线强化学习和微调作为案例研究,指出现有方法可被理解为依赖线性标量化的多目标优化算法。我们假设用更优算法替代线性标量化可提升性能。我们提出混合专家蒸馏(DiME)这一新型多目标强化学习算法,其性能超越线性标量化,并能应用于这些非标准多目标问题。实验表明,在离线强化学习中,DiME衍生出的简洁新算法性能超越当前最优水平;在微调任务中,我们推导出的新算法能学习超越教师策略的表现。