Online reinforcement learning and other adaptive sampling algorithms are increasingly used in digital intervention experiments to optimize treatment delivery for users over time. In this work, we focus on longitudinal user data collected by a large class of adaptive sampling algorithms that are designed to optimize treatment decisions online using accruing data from multiple users. Combining or "pooling" data across users allows adaptive sampling algorithms to potentially learn faster. However, by pooling, these algorithms induce dependence between the collected user data trajectories; we show that this can cause standard variance estimators for i.i.d. data to underestimate the true variance of common estimators on this data type. We develop novel methods to perform a variety of statistical analyses on such adaptively collected data via Z-estimation. Specifically, we introduce the adaptive sandwich variance estimator, a corrected sandwich estimator that leads to consistent variance estimates under adaptive sampling. Additionally, to prove our results we develop novel theoretical tools for empirical processes on adaptively collected longitudinal data which may be of independent interest. This work is motivated by our efforts in designing experiments in which online reinforcement learning algorithms optimize treatment decisions, yet statistical inference is essential for conducting analyses after the experiment concludes.
翻译:在线强化学习及其他适应性采样算法越来越多地应用于数字干预实验中,以随时间优化面向用户的治疗方案。本文聚焦于由一大类适应性采样算法收集的纵向用户数据,该类算法旨在利用累积的多用户数据在线优化治疗决策。通过跨用户合并或"池化"数据,适应性采样算法可能实现更快的学习速度。然而,这种池化操作会导致所收集的用户数据轨迹之间存在依赖性;我们证明,这会使独立同分布数据的标准方差估计量低估此类数据中常见估计量的真实方差。我们开发了基于Z估计的新方法,可对此类适应性采集数据进行多种统计分析。具体而言,我们提出了适应性夹心方差估计量——一种经修正的夹心估计量,可在适应性采样下得到一致的方差估计。此外,为证明上述结论,我们针对适应性采集的纵向数据开发了经验过程的新理论工具,这些工具可能具有独立的研究价值。本研究受我们设计实验的实践驱动,其中在线强化学习算法优化治疗决策,而统计推断对于实验结束后的分析至关重要。