Decision-focused (DF) model-based reinforcement learning has recently been introduced as a powerful algorithm that can focus on learning the MDP dynamics that are most relevant for obtaining high returns. While this approach increases the agent's performance by directly optimizing the reward, it does so by learning less accurate dynamics from a maximum likelihood perspective. We demonstrate that when the reward function is defined by preferences over multiple objectives, the DF model may be sensitive to changes in the objective preferences.In this work, we develop the robust decision-focused (RDF) algorithm, which leverages the non-identifiability of DF solutions to learn models that maximize expected returns while simultaneously learning models that transfer to changes in the preference over multiple objectives. We demonstrate the effectiveness of RDF on two synthetic domains and two healthcare simulators, showing that it significantly improves the robustness of DF model learning to changes in the reward function without compromising training-time return.
翻译:决策聚焦(DF)模型强化学习最近被引入为一种强大算法,专注于学习与获取高回报最相关的马尔可夫决策过程动力学。虽然该方法通过直接优化奖励来提升智能体性能,但从最大似然视角看,它是以学习精度较低的动力学为代价的。我们证明,当奖励函数由多目标偏好定义时,DF模型可能对目标偏好的变化敏感。本文开发了鲁棒决策聚焦(RDF)算法,该算法利用DF解的非可识别性,学习既能最大化期望回报又能迁移至多目标偏好变化的模型。我们在两个合成领域和两个医疗模拟器上验证了RDF的有效性,表明它在不牺牲训练期回报的前提下,显著提升了DF模型学习对奖励函数变化的鲁棒性。