We consider estimation and inference with data collected from episodic reinforcement learning (RL) algorithms; i.e. adaptive experimentation algorithms that at each period (aka episode) interact multiple times in a sequential manner with a single treated unit. Our goal is to be able to evaluate counterfactual adaptive policies after data collection and to estimate structural parameters such as dynamic treatment effects, which can be used for credit assignment (e.g. what was the effect of the first period action on the final outcome). Such parameters of interest can be framed as solutions to moment equations, but not minimizers of a population loss function, leading to $Z$-estimation approaches in the case of static data. However, such estimators fail to be asymptotically normal in the case of adaptive data collection. We propose a re-weighted $Z$-estimation approach with carefully designed adaptive weights to stabilize the episode-varying estimation variance, which results from the nonstationary policy that typical episodic RL algorithms invoke. We identify proper weighting schemes to restore the consistency and asymptotic normality of the re-weighted Z-estimators for target parameters, which allows for hypothesis testing and constructing uniform confidence regions for target parameters of interest. Primary applications include dynamic treatment effect estimation and dynamic off-policy evaluation.
翻译:我们考虑基于从序列强化学习(RL)算法(即每周期(又称episode)内与单个处理单元进行依次多次交互的自适应实验算法)收集的数据进行估计与推断。目标在于:在数据收集后评估反事实自适应策略,并估计动态处理效应等结构参数(例如首期动作对最终结果的影晌),此类参数可用于信用分配。待估计参数可表示为矩方程的解,而非总体损失函数的最小化器,这在静态数据情形下导向$Z$-估计方法。然而,当数据由自适应过程收集时,此类估计量无法保持渐近正态性。我们提出一种重加权$Z$-估计方法,通过精心设计自适应权重,以稳定由典型序列RL算法引发的非平稳策略导致的周期变化估计方差。我们确定了合适的加权方案,以恢复目标参数的重加权$Z$-估计量的一致性与渐近正态性,从而支持假设检验及构造目标参数的统一置信区域。主要应用包括动态处理效应估计与动态离策略评估。