When data are collected adaptively, such as in bandit algorithms, classical statistical approaches such as ordinary least squares and $M$-estimation will often fail to achieve asymptotic normality. Although recent lines of work have modified the classical approaches to ensure valid inference on adaptively collected data, most of these works assume that the model is correctly specified. The misspecified setting poses unique challenges because the parameter of interest itself may not be well-defined over a non-stationary distribution of rewards. We therefore tackle the problem of \emph{off-policy} inference in adaptive settings, where we uniquely define a projected solution over a stationary evaluation policy. Our method provides valid inference for $M$-estimators that use adaptively collected bandit data with a possibly misspecified working model. A key ingredient in our approach is the use of flexible approaches to stabilize the variance induced by adaptive data collection. A major novelty is that the procedure enables the construction of valid confidence sets even in settings where treatment policies are unstable and non-converging, such as when there is no unique optimal arm and standard bandit algorithms are used. Empirical results on semi-synthetic datasets constructed from the Osteoarthritis Initiative demonstrate that the method maintains type I error control, while existing methods for inference in adaptive settings do not cover in the misspecified case.
翻译:当数据以自适应方式收集时(例如在多臂赌博机算法中),普通最小二乘和$M$-估计等经典统计方法通常无法实现渐近正态性。尽管近期研究通过改进经典方法确保了自适应收集数据的有效推断,但多数工作假设模型设定正确。误设情形面临独特挑战,因为目标参数本身可能在非平稳的奖励分布中缺乏明确定义。为此,我们解决了自适应场景中的\emph{离策略}推断问题,通过在平稳评估策略上唯一定义投影解。该方法为使用自适应收集的赌博机数据且可能包含误设工作模型的$M$-估计量提供了有效推断。本方法的关键在于采用灵活技术来稳定自适应数据收集引起的方差波动。主要创新在于:即使在处理策略不稳定且不收敛的场景中(例如当不存在唯一最优臂且使用标准赌博机算法时),该程序仍能构建有效的置信集。基于骨关节炎倡议项目构建的半合成数据集实验表明,本方法能维持第一类错误控制,而现有自适应场景推断方法在误设情形下均无法实现覆盖。