In Offline Model Learning for Planning and in Offline Reinforcement Learning, the limited data set hinders the estimate of the Value function of the relative Markov Decision Process (MDP). Consequently, the performance of the obtained policy in the real world is bounded and possibly risky, especially when the deployment of a wrong policy can lead to catastrophic consequences. For this reason, several pathways are being followed with the scope of reducing the model error (or the distributional shift between the learned model and the true one) and, more broadly, obtaining risk-aware solutions with respect to model uncertainty. But when it comes to the final application which baseline should a practitioner choose? In an offline context where computational time is not an issue and robustness is the priority we propose Exploitation vs Caution (EvC), a paradigm that (1) elegantly incorporates model uncertainty abiding by the Bayesian formalism, and (2) selects the policy that maximizes a risk-aware objective over the Bayesian posterior between a fixed set of candidate policies provided, for instance, by the current baselines. We validate EvC with state-of-the-art approaches in different discrete, yet simple, environments offering a fair variety of MDP classes. In the tested scenarios EvC manages to select robust policies and hence stands out as a useful tool for practitioners that aim to apply offline planning and reinforcement learning solvers in the real world.
翻译:在离线模型学习用于规划与离线强化学习中,有限的数据集阻碍了对相应马尔可夫决策过程价值函数的准确估计。因此,所得策略在现实世界中的性能存在边界且可能具有风险,尤其是在部署错误策略可能导致灾难性后果时。为此,研究人员正探索多种途径以降低模型误差(或学习模型与真实模型之间的分布偏移),并更广泛地获取基于模型不确定性的风险感知解决方案。但面对最终应用场景,实践者应选择哪种基线方法?在计算时间不受限且鲁棒性为首要目标的离线环境下,我们提出“探索与谨慎”范式,该方法:(1) 遵循贝叶斯形式体系优雅地融合模型不确定性,(2) 在固定候选策略集(例如由当前基线方法提供)上,选择能够最大化贝叶斯后验分布下风险感知目标的策略。我们在不同离散但简单的环境中(涵盖多种马尔可夫决策过程类别)使用前沿方法验证了EvC。在测试场景中,EvC成功选择了鲁棒策略,因而成为旨在将离线规划与强化学习求解器应用于现实世界的实践者的实用工具。