Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon MDPs. In this paper, we study undiscounted off-policy policy evaluation for absorbing MDPs. Given the dataset consisting of the i.i.d episodes with a given truncation level, we propose a so-called MWLA algorithm to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound for the MWLA method is investigated and the dependence of statistical errors on the data size and the truncation level are analyzed. With an episodic taxi environment, computational experiments illustrate the performance of the MWLA algorithm.
翻译:强化学习策略评估问题通常被建模为有限或折扣/平均无限时域马尔可夫决策过程(MDP)。本文针对吸收型MDP研究无折扣离策略评估问题。给定由独立同分布情节(截断水平固定)构成的数据集,我们提出一种名为MWLA的算法,通过状态-动作占用度量的重要性比率直接估计期望回报。研究了MWLA方法的均方误差(MSE)界,分析了统计误差对数据规模和截断水平的依赖性。基于出租车情节环境进行的计算实验展示了MWLA算法的性能表现。