Multi-objective Markov decision processes are a special kind of multi-objective optimization problem that involves sequential decision making while satisfying the Markov property of stochastic processes. Multi-objective reinforcement learning methods address this problem by fusing the reinforcement learning paradigm with multi-objective optimization techniques. One major drawback of these methods is the lack of adaptability to non-stationary dynamics in the environment. This is because they adopt optimization procedures that assume stationarity to evolve a coverage set of policies that can solve the problem. This paper introduces a developmental optimization approach that can evolve the policy coverage set while exploring the preference space over the defined objectives in an online manner. We propose a novel multi-objective reinforcement learning algorithm that can robustly evolve a convex coverage set of policies in an online manner in non-stationary environments. We compare the proposed algorithm with two state-of-the-art multi-objective reinforcement learning algorithms in stationary and non-stationary environments. Results showed that the proposed algorithm significantly outperforms the existing algorithms in non-stationary environments while achieving comparable results in stationary environments.
翻译:多目标马尔可夫决策过程是一类特殊的涉及序列决策且满足随机过程马尔可夫性质的多目标优化问题。多目标强化学习方法通过融合强化学习范式与多目标优化技术来解决该问题。这些方法的主要缺陷在于缺乏对环境非平稳动态的适应性。这是因为它们采用假设平稳性的优化过程来演化可解决问题的策略覆盖集。本文提出一种发展式优化方法,能够以在线方式探索定义目标的偏好空间的同时演化策略覆盖集。我们设计了一种新颖的多目标强化学习算法,可在非平稳环境中以在线方式稳健地演化凸策略覆盖集。在平稳与非平稳环境下,我们将所提算法与两种当前最先进的多目标强化学习算法进行比较。结果表明,所提算法在非平稳环境中显著优于现有算法,同时在平稳环境中达到可比性能。