We develop an anytime-valid framework for optimal policy identification from logged contextual bandit data. In many applied settings, the analyst wants to select the optimal policy from a candidate policy class $Π$, but data are generated by an externally determined logging policy that they do not control. The analyst may also wish to monitor evidence continuously, stopping once the optimal policy is clear rather than committing to a fixed sample size in advance. This paper addresses these challenges by constructing a time-indexed set $S_t$ that retains the true optimal policy set uniformly over time with high probability. The resulting procedure allows the analyst to monitor policy values, eliminate clearly suboptimal policies, and stop at data-dependent times without invalidating inference. When the optimal policy is unique, we define a stopping time for its identification and derive a sample-complexity bound scaling as $O\!\left(\frac{\log |Π|+\log\log(1/Δ_{\min})}{Δ_{\min}^2}\right)$, where $Δ_{\min}$ is the gap between the best and second-best policy values. Simulations demonstrate that the anytime-valid approach can yield substantial sample savings relative to fixed-$N$ designs. An application to a large adaptive experiment on reducing misinformation online illustrates how the method provides a dynamic view as evidence on the optimal policy accumulates.
翻译:我们针对基于记录的上下文多臂赌博机数据提出了一种任意时间有效的最优策略识别框架。在许多应用场景中,分析师希望从候选策略类 $Π$ 中选择最优策略,但数据由不受其控制的外部记录的投放策略生成。分析师可能还希望持续监控证据,一旦最优策略明确便停止,而非预先承诺固定样本量。本文通过构建一个随时间索引的集合 $S_t$ 来解决这些挑战,该集合以高概率随时间一致地保留真实的最优策略集。由此产生的程序允许分析师监控策略值、排除明显次优策略,并在依赖于数据的时间点停止而不影响推断的有效性。当最优策略唯一时,我们定义了其识别的停止时间,并推导出样本复杂度界为 $O\!\left(\frac{\log |Π|+\log\log(1/Δ_{\min})}{Δ_{\min}^2}\right)$,其中 $Δ_{\min}$ 是最优与次优策略值之间的差距。模拟实验表明,与固定样本量设计相比,任意时间有效的方法能显著节省样本量。在大型减少网络错误信息的自适应实验中的应用展示了该方法如何随着最优策略证据的积累提供动态视图。