Standard bandit algorithms that assume continual reallocation of measurement effort are challenging to implement due to delayed feedback and infrastructural/organizational difficulties. Motivated by practical instances involving a handful of reallocation epochs in which outcomes are measured in batches, we develop a computation-driven adaptive experimentation framework that can flexibly handle batching. Our main observation is that normal approximations, which are universal in statistical inference, can also guide the design of adaptive algorithms. By deriving a Gaussian sequential experiment, we formulate a dynamic program that can leverage prior information on average rewards. Instead of the typical theory-driven paradigm, we leverage computational tools and empirical benchmarking for algorithm development. In particular, our empirical analysis highlights a simple yet effective algorithm, Residual Horizon Optimization, which iteratively solves a planning problem using stochastic gradient descent. Our approach significantly improves statistical power over standard methods, even when compared to Bayesian bandit algorithms (e.g., Thompson sampling) that require full distributional knowledge of individual rewards. Overall, we expand the scope of adaptive experimentation to settings that are difficult for standard methods, involving a small number of reallocation epochs, low signal-to-noise ratio, and unknown reward distributions.
翻译:标准赌博机算法假设持续重新分配测量力度,但由于反馈延迟以及基础设施/组织层面的困难,实际实施颇具挑战。受实践中仅涉及少量重新分配轮次(同时结果分批测量)的场景启发,我们开发了一种计算驱动的自适应实验框架,可灵活处理分批设置。我们的核心发现是:在统计推断中具有普适性的正态近似方法,同样可用于指导自适应算法的设计。通过推导高斯序贯实验,我们构建了一个可利用平均奖励先验信息的动态规划模型。不同于传统的理论驱动范式,我们借助计算工具与经验基准测试进行算法开发。具体而言,我们的实证分析揭示了一种简单而高效的算法——残差视野优化,该算法通过随机梯度下降迭代求解规划问题。与标准方法相比,该方法显著提升了统计功效,甚至优于需要完整奖励分布知识的贝叶斯赌博机算法(如汤普森采样)。总体而言,我们将自适应实验的适用范围扩展到标准方法难以处理的场景,包括少量重分配轮次、低信噪比以及未知奖励分布等情形。