Online reinforcement learning and other adaptive sampling algorithms are increasingly used in digital intervention experiments to optimize treatment delivery for users over time. In this work, we focus on longitudinal user data collected by a large class of adaptive sampling algorithms that are designed to optimize treatment decisions online using accruing data from multiple users. Combining or "pooling" data across users allows adaptive sampling algorithms to potentially learn faster. However, by pooling, these algorithms induce dependence between the collected user data trajectories; we show that this can cause standard variance estimators for i.i.d. data to underestimate the true variance of common estimators on this data type. We develop novel methods to perform a variety of statistical analyses on such adaptively collected data via Z-estimation. Specifically, we introduce the adaptive sandwich variance estimator, a corrected sandwich estimator that leads to consistent variance estimates under adaptive sampling. Additionally, to prove our results we develop significant theory for empirical processes on non-i.i.d., adaptively collected, longitudinal data. This work is motivated by our efforts in designing experiments in which online reinforcement learning algorithms pool data across users to learn to optimize treatment decisions, yet reliable statistical inference is essential for conducting a variety of statistical analyses after the experiment is over.
翻译:在线强化学习和其他自适应采样算法越来越多地应用于数字干预实验中,以随时间优化对用户的治疗分配。本研究聚焦于通过一类旨在利用多个用户的累积数据在线优化治疗决策的自适应采样算法所收集的纵向用户数据。合并或"汇集"不同用户的数据可使自适应采样算法可能更快地学习。然而,通过数据汇集,这些算法会在收集到的用户数据轨迹之间引入依赖性;我们证明这会导致适用于独立同分布数据的标准方差估计量低估此类数据上常见估计量的真实方差。我们开发了基于Z估计的新方法,对这类自适应收集的数据执行多种统计分析。具体而言,我们引入了自适应三明治方差估计量——一种经修正的三明治估计量,能在自适应采样下得到一致方差估计。此外,为证明我们的结论,我们针对非独立同分布、自适应收集的纵向数据建立了经验过程的重要理论。本研究源于我们在设计实验时的努力——实验中在线强化学习算法通过汇集不同用户数据来学习优化治疗决策,但在实验结束后,可靠的统计推断对于开展多种统计分析至关重要。