Multi-objective reinforcement learning (MORL) aims to find a set of high-performing and diverse policies that address trade-offs between multiple conflicting objectives. However, in practice, decision makers (DMs) often deploy only one or a limited number of trade-off policies. Providing too many diversified trade-off policies to the DM not only significantly increases their workload but also introduces noise in multi-criterion decision-making. With this in mind, we propose a human-in-the-loop policy optimization framework for preference-based MORL that interactively identifies policies of interest. Our method proactively learns the DM's implicit preference information without requiring any a priori knowledge, which is often unavailable in real-world black-box decision scenarios. The learned preference information is used to progressively guide policy optimization towards policies of interest. We evaluate our approach against three conventional MORL algorithms that do not consider preference information and four state-of-the-art preference-based MORL algorithms on two MORL environments for robot control and smart grid management. Experimental results fully demonstrate the effectiveness of our proposed method in comparison to the other peer algorithms.
翻译:多目标强化学习旨在寻找一组高性能且多样化的策略,以解决多个相互冲突目标之间的权衡问题。然而在实际应用中,决策者通常只需部署一个或有限数量的折衷策略。向决策者提供过多多样化的折衷策略不仅会显著增加其工作负荷,还会在多准则决策中引入噪声。基于这一观察,我们提出了一种面向基于偏好的多目标强化学习的人机协同策略优化框架,能够交互式地识别决策者感兴趣的策略。该方法无需任何先验知识即可主动学习决策者的隐含偏好信息,这在实际黑箱决策场景中往往是不可获取的。学习到的偏好信息被用于逐步引导策略优化向目标策略方向收敛。我们在两个面向机器人控制与智能电网管理的多目标强化学习环境中,将所提方法与三种不考虑偏好信息的传统多目标强化学习算法及四种最先进的基于偏好的多目标强化学习算法进行了对比评估。实验结果充分证明了所提方法相较于其他对比算法的有效性。