In Multi-objective Reinforcement Learning (MORL) agents are tasked with optimising decision-making behaviours that trade-off between multiple, possibly conflicting, objectives. MORL based on decomposition is a family of solution methods that employ a number of utility functions to decompose the multi-objective problem into individual single-objective problems solved simultaneously in order to approximate a Pareto front of policies. We focus on the case of linear utility functions parameterised by weight vectors w. We introduce a method based on Upper Confidence Bound to efficiently search for the most promising weight vectors during different stages of the learning process, with the aim of maximising the hypervolume of the resulting Pareto front. The proposed method is shown to outperform various MORL baselines on Mujoco benchmark problems across different random seeds. The code is online at: https://github.com/SYCAMORE-1/ucb-MOPPO.
翻译:在多目标强化学习(MORL)中,智能体需要优化在多个可能冲突的目标之间进行权衡的决策行为。基于分解的多目标强化学习是一类求解方法,它通过使用多个效用函数将多目标问题分解为同时求解的多个单目标子问题,从而逼近策略的帕累托前沿。本文聚焦于由权重向量w参数化的线性效用函数情形。我们提出一种基于上置信界(Upper Confidence Bound)的方法,在学习过程的不同阶段高效搜索最具潜力的权重向量,其目标是最大化所得帕累托前沿的超体积指标。实验表明,所提方法在Mujoco基准测试问题中,针对不同随机种子均优于多种多目标强化学习基线方法。代码开源地址:https://github.com/SYCAMORE-1/ucb-MOPPO。