Off-policy evaluation (OPE) is a fundamental task in reinforcement learning (RL). In the classic setting of linear OPE, finite-sample guarantees often take the form $$ \textrm{Evaluation error} \le \textrm{poly}(C^π, d, 1/n,\log(1/δ)), $$ where $d$ is the dimension of the features and $C^π$ is a coverage parameter that characterizes the degree to which the visited features lie in the span of the data distribution. While such guarantees are well-understood for several popular algorithms under stronger assumptions (e.g. Bellman completeness), the understanding is lacking and fragmented in the minimal setting where only the target value function is linearly realizable in the features. Despite recent interest in tight characterizations of the statistical rate in this setting, the right notion of coverage remains unclear, and candidate definitions from prior analyses have undesirable properties and are starkly disconnected from more standard definitions in the literature. We provide a novel finite-sample analysis of a canonical algorithm for this setting, LSTDQ. Inspired by an instrumental-variable view, we develop error bounds that depend on a novel coverage parameter, the feature-dynamics coverage, which can be interpreted as linear coverage in an induced dynamical system for feature evolution. With further assumptions -- such as Bellman-completeness -- our definition successfully recovers the coverage parameters specialized to those settings, finally yielding a unified understanding for coverage in linear OPE.
翻译:离策略评估(OPE)是强化学习(RL)中的一项基础任务。在线性 OPE 的经典设定中,有限样本保证通常采用 $$ \textrm{评估误差} \le \textrm{poly}(C^π, d, 1/n,\log(1/δ)) $$ 的形式,其中 $d$ 是特征维度,$C^π$ 是一个覆盖参数,用于刻画访问到的特征位于数据分布张成的空间中的程度。尽管在更强的假设(例如贝尔曼完备性)下,几种流行算法的此类保证已得到充分理解,但在仅目标值函数在特征中线性可实现的极简设定下,相关理解仍然缺乏且零散。尽管近期对此设定下统计速率的紧致刻画产生了兴趣,但正确的覆盖率概念仍不明确,先前分析中提出的候选定义具有不良特性,并且与文献中更标准的定义存在显著脱节。我们对适用于此设定的经典算法 LSTDQ 进行了新颖的有限样本分析。受工具变量视角的启发,我们推导了依赖于一个新颖的覆盖参数——特征动态覆盖——的误差界,该参数可被解释为特征演化的诱导动力系统中的线性覆盖。在引入进一步假设(例如贝尔曼完备性)后,我们的定义成功地恢复了针对这些特定设定的覆盖参数,最终为线性 OPE 中的覆盖率提供了统一的理解。