We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, collected by other agents, we seek to obtain a (tight) lower bound on a policy's performance. This is crucial when deciding whether a given policy satisfies certain minimal performance/safety criteria before it can be deployed in the real world. To this end, we introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics. To form a conservative estimate of the policy's performance, HAMBO hallucinates worst-case trajectories that the policy may take, within the margin of the models' epistemic confidence regions. We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return. Finally, we discuss scalable variants of our approach based on Bayesian Neural Networks and empirically demonstrate that they yield reliable and tight lower bounds in various continuous control environments.
翻译:我们研究保守离线策略评估(COPE)问题,即给定由其他智能体收集的环境交互离线数据集,我们旨在获得策略性能的(严格)下界。这在决定某个策略在部署到现实世界前是否满足最低性能/安全标准时至关重要。为此,我们提出HAMBO方法,该方法基于对转移动态建模的不确定性感知学习模型。为了形成策略性能的保守估计,HAMBO在模型认知置信区间内,幻觉化策略可能采取的最坏情况轨迹。我们证明由此产生的COPE估计是有效的下界,并在正则条件下证明其收敛到真实期望回报。最后,我们讨论基于贝叶斯神经网络的可扩展变体方法,并通过实验证明它们在各种连续控制环境中能够产生可靠且严格的下界。