In many applications, e.g. in healthcare and e-commerce, the goal of a contextual bandit may be to learn an optimal treatment assignment policy at the end of the experiment. That is, to minimize simple regret. However, this objective remains understudied. We propose a new family of computationally efficient bandit algorithms for the stochastic contextual bandit setting, where a tuning parameter determines the weight placed on cumulative regret minimization (where we establish near-optimal minimax guarantees) versus simple regret minimization (where we establish state-of-the-art guarantees). Our algorithms work with any function class, are robust to model misspecification, and can be used in continuous arm settings. This flexibility comes from constructing and relying on "conformal arm sets" (CASs). CASs provide a set of arms for every context, encompassing the context-specific optimal arm with a certain probability across the context distribution. Our positive results on simple and cumulative regret guarantees are contrasted with a negative result, which shows that no algorithm can achieve instance-dependent simple regret guarantees while simultaneously achieving minimax optimal cumulative regret guarantees.
翻译:在许多应用场景中,例如医疗保健和电子商务,上下文赌博机的目标可能是在实验结束时学习最优的治疗分配策略,即最小化简单遗憾。然而,这一目标至今仍研究不足。我们针对随机上下文赌博机设定,提出了一类计算高效的赌博机算法家族,其中通过一个调谐参数决定最小化累积遗憾(我们在此建立了近最优的极小化极大保证)与最小化简单遗憾(我们在此建立了最先进的保证)之间的权重分配。我们的算法适用于任何函数类,对模型误设具有鲁棒性,并可在连续臂设定中使用。这种灵活性源于构建并依赖"保形臂集"。保形臂集为每个上下文提供一组臂集,以一定概率在整个上下文分布中涵盖上下文特定的最优臂。我们在简单遗憾和累积遗憾保证方面的积极结果与一项消极结果形成对比,该结果表明,没有任何算法能在实现实例相关简单遗憾保证的同时,达到极小化极大最优的累积遗憾保证。