Simple regret minimization is a critical problem in learning optimal treatment assignment policies across various domains, including healthcare and e-commerce. However, it remains understudied in the contextual bandit setting. We propose a new family of computationally efficient bandit algorithms for the stochastic contextual bandit settings, with the flexibility to be adapted for cumulative regret minimization (with near-optimal minimax guarantees) and simple regret minimization (with SOTA guarantees). Furthermore, our algorithms adapt to model misspecification and extend to the continuous arm settings. These advantages come from constructing and relying on "conformal arm sets" (CASs), which provide a set of arms at every context that encompass the context-specific optimal arm with some probability across the context distribution. Our positive results on simple and cumulative regret guarantees are contrasted by a negative result, which shows that an algorithm can't achieve instance-dependent simple regret guarantees while simultaneously achieving minimax optimal cumulative regret guarantees.
翻译:简单遗憾最小化是跨医疗、电子商务等领域学习最优处理分配策略的关键问题,但在上下文赌博机框架下仍缺乏深入研究。我们针对随机上下文赌博机场景提出了一种新型高效计算家族算法,该算法兼具灵活适配性:既可在近似最优最小最大保证下实现累积遗憾最小化,又能以当前最优保证实现简单遗憾最小化。此外,我们的算法能够适应模型误设问题,并扩展至连续臂设置。这些优势源于构建并依赖"保形臂集"——该集合为每个上下文提供一组臂,保证以特定概率覆盖上下文分布中该上下文对应的最优臂。我们在简单与累积遗憾保证方面取得的正面结果与一项负面发现形成对比:研究表明,任何算法都无法在实现实例依赖型简单遗憾保证的同时,获得最小最大最优的累积遗憾保证。