Standard bandit algorithms that assume continual reallocation of measurement effort are challenging to implement due to delayed feedback and infrastructural/organizational difficulties. Motivated by practical instances involving a handful of reallocation epochs in which outcomes are measured in batches, we develop a new adaptive experimentation framework that can flexibly handle any batch size. Our main observation is that normal approximations universal in statistical inference can also guide the design of scalable adaptive designs. By deriving an asymptotic sequential experiment, we formulate a dynamic program that can leverage prior information on average rewards. State transitions of the dynamic program are differentiable with respect to the sampling allocations, allowing the use of gradient-based methods for planning and policy optimization. We propose a simple iterative planning method, Residual Horizon Optimization, which selects sampling allocations by optimizing a planning objective via stochastic gradient-based methods. Our method significantly improves statistical power over standard adaptive policies, even when compared to Bayesian bandit algorithms (e.g., Thompson sampling) that require full distributional knowledge of individual rewards. Overall, we expand the scope of adaptive experimentation to settings which are difficult for standard adaptive policies, including problems with a small number of reallocation epochs, low signal-to-noise ratio, and unknown reward distributions.
翻译:标准赌博机算法假设测量力度持续重新分配,但由于反馈延迟及基础设施/组织层面的困难,在实际中难以实施。受涉及少量重新分配时期且结果以批次测量的实际案例启发,我们开发了一种能灵活处理任意批次大小的新型自适应实验框架。我们的核心发现是:统计推断中普遍使用的正态近似方法同样可用于指导可扩展自适应实验的设计。通过推导渐近序贯实验,我们构建了一个能利用平均奖励先验信息的动态规划模型。该动态规划的状态转移关于采样分配可微,从而允许使用基于梯度的方法进行规划和策略优化。我们提出了一种简单的迭代规划方法——残差时域优化,该方法通过随机梯度优化规划目标来选择采样分配。即使与需要完全掌握个体奖励分布信息的贝叶斯赌博机算法(如汤普森采样)相比,我们的方法也能显著提升统计功效。总体而言,我们将自适应实验的应用范围扩展至标准自适应策略难以处理的场景,包括重新分配时期少、信噪比低及奖励分布未知的问题。