A common approach to learning mobile health (mHealth) intervention policies is linear Thompson sampling. Two desirable mHealth policy features are (1) pooling information across individuals and time and (2) incorporating a time-varying baseline reward. Previous approaches pooled information across individuals but not time, failing to capture trends in treatment effects over time. In addition, these approaches did not explicitly model the baseline reward, which limited the ability to precisely estimate the parameters in the differential reward model. In this paper, we propose a novel Thompson sampling algorithm, termed ''DML-TS-NNR'' that leverages (1) nearest-neighbors to efficiently pool information on the differential reward function across users and time and (2) the Double Machine Learning (DML) framework to explicitly model baseline rewards and stay agnostic to the supervised learning algorithms used. By explicitly modeling baseline rewards, we obtain smaller confidence sets for the differential reward parameters. We offer theoretical guarantees on the pseudo-regret, which are supported by empirical results. Importantly, the DML-TS-NNR algorithm demonstrates robustness to potential misspecifications in the baseline reward model.
翻译:在移动健康(mHealth)干预策略学习中,线性汤普森采样是常用方法。其两个关键特性包括:(1) 跨个体和时间的信息池化;(2) 纳入时变基线奖励。现有方法虽能实现跨个体信息池化,但未考虑时间维度,难以捕捉治疗效果随时间变化的趋势。此外,这些方法未显式建模基线奖励,导致差分奖励模型参数估计精度受限。本文提出一种新型汤普森采样算法"DML-TS-NNR",该算法通过:(1) 利用近邻法高效池化跨用户和时间的差分奖励函数信息;(2) 采用双重机器学习(DML)框架显式建模基线奖励,且对所用监督学习算法保持不可知性。通过显式建模基线奖励,我们获得了更紧凑的差分奖励参数置信集。理论伪遗憾界保证与实证结果相互印证。重要的是,DML-TS-NNR算法对基线奖励模型的潜在误设定展现出稳健性。