Multi-arm bandits are gaining popularity as they enable real-world sequential decision-making across application areas, including clinical trials, recommender systems, and online decision-making. Consequently, there is an increased desire to use the available adaptively collected datasets to distinguish whether one arm was more effective than the other, e.g., which product or treatment was more effective. Unfortunately, existing tools fail to provide valid inference when data is collected adaptively or require many untestable and technical assumptions, e.g., stationarity, iid rewards, bounded random variables, etc. Our paper introduces the design-based approach to inference for multi-arm bandits, where we condition the full set of potential outcomes and perform inference on the obtained sample. Our paper constructs valid confidence intervals for both the reward mean of any arm and the mean reward difference between any arms in an assumption-light manner, allowing the rewards to be arbitrarily distributed, non-iid, and from non-stationary distributions. In addition to confidence intervals, we also provide valid design-based confidence sequences, sequences of confidence intervals that have uniform type-1 error guarantees over time. Confidence sequences allow the agent to perform a hypothesis test as the data arrives sequentially and stop the experiment as soon as the agent is satisfied with the inference, e.g., the mean reward of an arm is statistically significantly higher than a desired threshold.
翻译:多臂老虎机因在临床试验、推荐系统和在线决策等应用领域实现真实世界序贯决策而日益流行。因此,利用现有自适应收集的数据集来区分不同臂的有效性(例如哪种产品或治疗方案更有效)的需求日益增加。遗憾的是,当数据自适应收集时,现有工具无法提供有效推断,或需要大量无法检验且技术性强的假设(如平稳性、独立同分布奖励、有界随机变量等)。我们的论文引入了基于设计的推断方法用于多臂老虎机,我们以完整潜在结果集为条件,对获得的样本进行推断。本文以假设较少的低方法构建了任意臂的奖励均值及任意两臂间均值差异的有效置信区间,允许奖励具有任意分布、非独立同分布且来自非平稳分布。除置信区间外,我们还提供了有效的基于设计置信序列——即随时间具有均匀第一类错误保证的置信区间序列。置信序列允许智能体在数据序贯到达时进行假设检验,并在对推断结果满意时(例如某臂的均值奖励在统计上显著高于期望阈值)立即终止实验。