Off-policy evaluation (OPE) aims to estimate the benefit of following a counterfactual sequence of actions, given data collected from executed sequences. However, existing OPE estimators often exhibit high bias and high variance in problems involving large, combinatorial action spaces. We investigate how to mitigate this issue using factored action spaces i.e. expressing each action as a combination of independent sub-actions from smaller action spaces. This approach facilitates a finer-grained analysis of how actions differ in their effects. In this work, we propose a new family of "decomposed" importance sampling (IS) estimators based on factored action spaces. Given certain assumptions on the underlying problem structure, we prove that the decomposed IS estimators have less variance than their original non-decomposed versions, while preserving the property of zero bias. Through simulations, we empirically verify our theoretical results, probing the validity of various assumptions. Provided with a technique that can derive the action space factorisation for a given problem, our work shows that OPE can be improved "for free" by utilising this inherent problem structure.
翻译:离策略评估旨在根据已执行序列收集的数据,估计遵循反事实动作序列的收益。然而,现有离策略评估估计器在处理大规模组合动作空间时,往往表现出高偏差和高方差问题。我们研究如何通过采用因子动作空间(即将每个动作表示为更小动作空间中的独立子动作组合)来缓解这一问题。这种方法能够对动作效应的差异进行更细粒度的分析。在本工作中,我们提出一种基于因子动作空间的新型"分解式"重要性采样估计器家族。在给定底层问题结构特定假设的条件下,我们证明分解式重要性采样估计器相比原始非分解版本具有更低的方差,同时保持零偏差特性。通过仿真实验,我们实证验证了理论结果,并检验了各类假设的有效性。结合可推导特定问题动作空间因子化的技术,我们的研究表明,利用这种底层问题结构能够"免费"改进离策略评估。