We consider estimation and inference using data collected from reinforcement learning algorithms. These algorithms, characterized by their adaptive experimentation, interact with individual units over multiple stages, dynamically adjusting their strategies based on previous interactions. Our goal is to evaluate a counterfactual policy post-data collection and estimate structural parameters, like dynamic treatment effects, which can be used for credit assignment and determining the effect of earlier actions on final outcomes. Such parameters of interest can be framed as solutions to moment equations, but not minimizers of a population loss function, leading to Z-estimation approaches for static data. However, in the adaptive data collection environment of reinforcement learning, where algorithms deploy nonstationary behavior policies, standard estimators do not achieve asymptotic normality due to the fluctuating variance. We propose a weighted Z-estimation approach with carefully designed adaptive weights to stabilize the time-varying estimation variance. We identify proper weighting schemes to restore the consistency and asymptotic normality of the weighted Z-estimators for target parameters, which allows for hypothesis testing and constructing uniform confidence regions. Primary applications include dynamic treatment effect estimation and dynamic off-policy evaluation.
翻译:本文考虑利用强化学习算法收集的数据进行估计与推断。这类算法以自适应实验为特征,在多阶段中与个体单元交互,并根据先前交互动态调整策略。我们的目标是在数据收集后评估反事实策略,并估计动态治疗效果等结构参数,这些参数可用于信用分配及确定早期行为对最终结果的影响。此类感兴趣的参数可表述为矩方程的解,而非总体损失函数的最小化者,由此衍生出适用于静态数据的Z估计方法。然而,在强化学习的自适应数据收集环境中,算法采用非平稳行为策略,标准估计量因方差波动而无法实现渐近正态性。我们提出一种加权Z估计方法,通过精心设计的自适应权重来稳定时变估计方差。我们确定了适当的加权方案,以恢复目标参数加权Z估计量的一致性和渐近正态性,从而支持假设检验与构建统一置信区域。主要应用包括动态治疗效果估计和动态离策略评估。