Standard bandit algorithms that assume continual reallocation of measurement effort are challenging to implement due to delayed feedback and infrastructural/organizational difficulties. Motivated by practical instances involving a handful of reallocation epochs in which outcomes are measured in batches, we develop a computation-driven adaptive experimentation framework that can flexibly handle batching. Our main observation is that normal approximations, which are universal in statistical inference, can also guide the design of adaptive algorithms. By deriving a Gaussian sequential experiment, we formulate a dynamic program that can leverage prior information on average rewards. Instead of the typical theory-driven paradigm, we leverage computational tools and empirical benchmarking for algorithm development. In particular, our empirical analysis highlights a simple yet effective algorithm, Residual Horizon Optimization, which iteratively solves a planning problem using stochastic gradient descent. Our approach significantly improves statistical power over standard methods, even when compared to Bayesian bandit algorithms (e.g., Thompson sampling) that require full distributional knowledge of individual rewards. Overall, we expand the scope of adaptive experimentation to settings that are difficult for standard methods, involving limited adaptivity, low signal-to-noise ratio, and unknown reward distributions.
翻译:标准老虎机算法假设测量工作的持续重新分配,但由于反馈延迟以及基础设施/组织层面的障碍,其实施面临挑战。受实际应用中仅涉及少量重新分配阶段(以批次测量结果)的案例启发,我们开发了一个以计算驱动的自适应实验框架,可灵活处理批处理问题。我们的核心发现是:在统计推断中具有普适性的正态近似,同样可指导自适应算法的设计。通过推导高斯序贯实验,我们构建了一个能利用平均奖励先验信息的动态规划模型。与传统的理论驱动范式不同,我们借助计算工具与经验基准进行算法开发。具体而言,实证分析突出了一种简单而有效的算法——残差视界优化,该算法通过随机梯度下降迭代求解规划问题。相较于标准方法(甚至包括需要完全掌握个体奖励分布信息的贝叶斯老虎机算法,如汤普森采样),我们的方法显著提升了统计功效。总体而言,我们将自适应实验的适用范围扩展至标准方法难以处理的场景,包括有限适应性、低信噪比和未知奖励分布的情形。