We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, collected by other agents, we seek to obtain a (tight) lower bound on a policy's performance. This is crucial when deciding whether a given policy satisfies certain minimal performance/safety criteria before it can be deployed in the real world. To this end, we introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics. To form a conservative estimate of the policy's performance, HAMBO hallucinates worst-case trajectories that the policy may take, within the margin of the models' epistemic confidence regions. We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return. Finally, we discuss scalable variants of our approach based on Bayesian Neural Networks and empirically demonstrate that they yield reliable and tight lower bounds in various continuous control environments.
翻译:我们研究保守离线策略评估(COPE)问题,其中给定其他智能体收集的离线环境交互数据集,我们寻求获得策略性能的(紧)下界。这在决定某个策略在部署到现实世界前是否满足特定的最低性能/安全标准时至关重要。为此,我们提出了HAMBO,它基于对转移动力学的不确定性感知学习模型。为了形成策略性能的保守估计,HAMBO在模型认知置信区域的边界内,幻觉出策略可能采取的最坏情况轨迹。我们证明了所得COPE估计是有效的下界,并在正则性条件下证明了其收敛到真实期望回报。最后,我们讨论了基于贝叶斯神经网络的该方法的可扩展变体,并通过实验证明它们在各种连续控制环境中能生成可靠且紧的下界。