We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic innovation: abstention. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the stochastic instantaneous reward before observing it. When opting for abstention, the agent either suffers a fixed regret or gains a guaranteed reward. This added layer of complexity naturally prompts the key question: can we develop algorithms that are both computationally efficient and asymptotically and minimax optimal in this setting? We answer this question in the affirmative by designing and analyzing algorithms whose regrets meet their corresponding information-theoretic lower bounds. Our results offer valuable quantitative insights into the benefits of the abstention option, laying the groundwork for further exploration in other online decision-making problems with such an option. Extensive numerical experiments validate our theoretical results, demonstrating that our approach not only advances theory but also has the potential to deliver significant practical benefits.
翻译:我们提出了经典多臂赌博机问题的一个新颖扩展,引入了额外的策略性创新:放弃选项。在此增强框架中,智能体不仅需要在每个时间步选择一个臂,还可以选择在观测随机即时奖励之前放弃接受该奖励。当选择放弃时,智能体要么遭受固定遗憾,要么获得有保障的奖励。这一新增的复杂性自然引出了一个关键问题:在此设定下,我们能否开发出既计算高效又具有渐近与极小极大最优性的算法?我们通过设计和分析遗憾达到相应信息论下界的算法,对此问题给出了肯定回答。我们的结果为放弃选项的益处提供了有价值的定量见解,为在其他具有此类选项的在线决策问题中开展进一步探索奠定了基础。广泛的数值实验验证了我们的理论结果,表明我们的方法不仅推进了理论发展,还具有带来显著实际效益的潜力。