In off-policy reinforcement learning, a behaviour policy performs exploratory interactions with the environment to obtain state-action-reward samples which are then used to learn a target policy that optimises the expected return. This leads to a problem of off-policy evaluation, where one needs to evaluate the target policy from samples collected by the often unrelated behaviour policy. Importance sampling is a traditional statistical technique that is often applied to off-policy evaluation. While importance sampling estimators are unbiased, their variance increases exponentially with the horizon of the decision process due to computing the importance weight as a product of action probability ratios, yielding estimates with low accuracy for domains involving long-term planning. This paper proposes state-based importance sampling, which drops the action probability ratios of sub-trajectories with ``negligible states'' -- roughly speaking, those for which the chosen actions have no impact on the return estimate -- from the computation of the importance weight. Theoretical results show this reduces the ordinary importance sampling variance from $O(\exp(H))$ to $O(\exp(X))$ where $X < H$ is the largest subtrajectory with non-negligible states. To identify negligible states, two search algorithms are proposed, one based on covariance testing and one based on state-action values. We formulate state-based variants of ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stationary density ratio estimation. Experiments in four distinct domains show that state-based methods consistently yield reduced variance and improved accuracy compared to their traditional counterparts.
翻译:在离策略强化学习中,行为策略与环境进行探索性交互以获取状态-动作-奖励样本,这些样本随后用于学习优化期望回报的目标策略。这引出了离策略评估问题——即需要从通常无关的行为策略收集的样本中评估目标策略。重要性抽样是传统统计技术,常被应用于离策略评估。尽管重要性抽样估计器无偏,但其方差会随决策过程的时间跨度呈指数增长,原因是重要性权重需计算为动作概率比率的乘积,导致在涉及长期规划的领域中精度较低。本文提出基于状态的重要性抽样,该方法从重要性权重计算中剔除具有"可忽略状态"的子轨迹(大致而言,即所选动作对回报估计无影响的状态)的动作概率比率。理论结果表明,这将普通重要性抽样的方差从$O(\exp(H))$降至$O(\exp(X))$,其中$X < H$是包含不可忽略状态的最大子轨迹长度。为识别可忽略状态,本文提出两种搜索算法:基于协方差检验的方法和基于状态-动作值的方法。我们构建了普通重要性抽样、加权重要性抽样、逐决策重要性抽样、增量重要性抽样、双重稳健离策略评估以及稳态密度比估计的状态基变体。在四个不同领域的实验表明,与传统方法相比,基于状态的方法始终能降低方差并提高精度。