In Multi-objective Reinforcement Learning (MORL) agents are tasked with optimising decision-making behaviours that trade-off between multiple, possibly conflicting, objectives. MORL based on decomposition is a family of solution methods that employ a number of utility functions to decompose the multi-objective problem into individual single-objective problems solved simultaneously in order to approximate a Pareto front of policies. We focus on the case of linear utility functions parameterised by weight vectors w. We introduce a method based on Upper Confidence Bound to efficiently search for the most promising weight vectors during different stages of the learning process, with the aim of maximising the hypervolume of the resulting Pareto front. The proposed method is shown to outperform various MORL baselines on Mujoco benchmark problems across different random seeds. The code is online at: https://github.com/SYCAMORE-1/ucb-MOPPO.
翻译:在多目标强化学习(MORL)中,智能体需要优化在多个(可能相互冲突)目标之间进行权衡的决策行为。基于分解的MORL是一类求解方法,它通过使用多个效用函数将多目标问题分解为若干同时求解的独立单目标问题,从而近似得到策略的帕累托前沿。本文聚焦于由权重向量w参数化的线性效用函数情形。我们提出了一种基于置信上界(Upper Confidence Bound)的方法,能够在学习过程的不同阶段高效搜索最具潜力的权重向量,以最大化最终帕累托前沿的超体积指标。实验表明,所提方法在Mujoco基准测试问题中,在不同随机种子下均优于多种MORL基线方法。代码开源地址:https://github.com/SYCAMORE-1/ucb-MOPPO