This work aims to study off-policy evaluation (OPE) under scenarios where two key reinforcement learning (RL) assumptions -- temporal stationarity and individual homogeneity are both violated. To handle the ``double inhomogeneities", we propose a class of latent factor models for the reward and observation transition functions, under which we develop a general OPE framework that consists of both model-based and model-free approaches. To our knowledge, this is the first paper that develops statistically sound OPE methods in offline RL with double inhomogeneities. It contributes to a deeper understanding of OPE in environments, where standard RL assumptions are not met, and provides several practical approaches in these settings. We establish the theoretical properties of the proposed value estimators and empirically show that our approach outperforms competing methods that ignore either temporal nonstationarity or individual heterogeneity. Finally, we illustrate our method on a data set from the Medical Information Mart for Intensive Care.
翻译:本文旨在研究当强化学习的两个关键假设——时间平稳性与个体同质性——同时被违反时的离策略评估问题。为应对“双重非平稳性”,我们提出了一类用于奖励函数和观测转移函数的潜变量模型,并在此框架下构建了包含基于模型与免模型方法的通用离策略评估体系。据我们所知,这是首篇在具有双重非平稳性的离线强化学习中开发出统计严谨的离策略评估方法的论文。该工作深化了对标准强化学习假设不成立环境中离策略评估问题的理解,并提供了若干实用解决方案。我们建立了所提出价值估计量的理论性质,并通过实验证明,所提方法在性能上优于忽略时间非平稳性或个体异质性的对比方法。最后,我们在重症监护医学信息集数据集上展示了该方法的应用效果。