Multi-armed bandits are widely used for sequential experimentation in clinical trials, recommendation systems, and online platforms. While regret minimization and valid inference from adaptively collected data have each been studied extensively, a basic question remains: when does adaptivity \emph{improve estimation precision} relative to uniform designs, and how should inference be balanced against the online cost of experimentation? We first study arm-level mean estimation under mean-squared-error (MSE) objectives. We characterize when an adaptive Neyman allocation, which allocates samples according to arm variance, yields strict MSE improvements over uniform sampling. When there is variance heterogeneity across arms, these improvements arise at modest sample sizes, clarifying that adaptivity can be preferable for inference not only asymptotically, but also in many practical finite-sample settings. We then study a joint inference-regret objective that accounts for the cost of assigning units to inferior arms during experimentation. We propose the Static-Allocation Rate Policy (SARP) and Neyman-Adaptive Rate Policy (NARP), which interpolates between inference- and regret-oriented policies by adjusting exploration to the local structure of the instance. We show that SARP and NARP converge to the complete-information benchmark at the optimal rate as the sampling budget grows. Our proposed policies are practically attractive as it linearly interpolates between any standard regret-minimizing algorithm and inference-targeting adaptive policies. Yet we show it still enjoys the oracle-based asymptotic optimal rate. Simulations support the theory by demonstrating improved precision over uniform allocation while controlling performance loss across a range of instances.
翻译:多臂老虎机广泛应用于临床试验、推荐系统和在线平台的顺序实验。尽管遗憾最小化和自适应收集数据的有效推断已得到广泛研究,但一个基本问题仍待解答:相较于均匀设计,自适应性何时能提升估计精度,以及如何平衡推断与在线实验成本?我们首先研究基于均方误差目标的臂级均值估计。我们刻画了自适应奈曼分配(根据臂方差分配样本)何时能在均方误差上严格优于均匀采样。当各臂方差存在异质性时,这种改进在适度样本量下即可实现,这阐明自适应性不仅在渐近情况下,而且在许多实际有限样本设置中都能为推断带来优势。随后,我们研究了一个考虑实验期间将单元分配给劣质臂成本的联合推断-遗憾目标。我们提出了静态分配率策略和奈曼自适应率策略,这两种策略通过根据实例的局部结构调整探索,在面向推断和面向遗憾的策略之间进行插值。我们证明,随着采样预算增长,SARP和NARP能以最优速率收敛到完全信息基准。这些策略具有实际吸引力,因为它们能在任何标准遗憾最小化算法与目标推断的自适应策略之间进行线性插值,同时仍保持基于最优渐近速率的预言机性质。仿真实验通过展示跨多种实例中,在控制性能损失的同时,该策略相较均匀分配能提升精度,从而支持了理论分析。