Decision-focused (DF) model-based reinforcement learning has recently been introduced as a powerful algorithm which can focus on learning the MDP dynamics which are most relevant for obtaining high rewards. While this approach increases the performance of agents by focusing the learning towards optimizing for the reward directly, it does so by learning less accurate dynamics (from a MLE standpoint), and may thus be brittle to changes in the reward function. In this work, we develop the robust decision-focused (RDF) algorithm which leverages the non-identifiability of DF solutions to learn models which maximize expected returns while simultaneously learning models which are robust to changes in the reward function. We demonstrate on a variety of toy example and healthcare simulators that RDF significantly increases the robustness of DF to changes in the reward function, without decreasing the overall return the agent obtains.
翻译:决策聚焦(DF)的基于模型强化学习近期被提出作为一种强大算法,其能够专注于学习与获得高奖励最相关的马尔可夫决策过程(MDP)动态特性。尽管该方法通过直接优化奖励来提升智能体的表现,但它会学习到(从最大似然估计角度看)精度较低的动态模型,从而可能对奖励函数的变化缺乏鲁棒性。本研究提出鲁棒性决策聚焦(RDF)算法,该算法利用DF解的非可辨识性,在最大化期望收益的同时学习对奖励函数变化具有鲁棒性的模型。我们在多种玩具示例和医疗模拟器上验证表明,RDF在不降低智能体整体收益的前提下,显著提升了DF对奖励函数变化的鲁棒性。