We study the survival bandit problem, a variant of the multi-armed bandit problem with a constraint on the cumulative reward; at each time step, the agent receives a reward in [-1, 1] and if the cumulative reward becomes lower than a preset threshold, the procedure stops, and this phenomenon is called ruin. To our knowledge, this is the first paper studying a framework where the ruin might occur but not always. We first discuss that no policy can achieve a sublinear regret as defined in the standard multi-armed bandit problem, because a single pull of an arm may increase significantly the risk of ruin. Instead, we establish the framework of Pareto-optimal policies, which is a class of policies whose cumulative reward for some instance cannot be improved without sacrificing that for another instance. To this end, we provide tight lower bounds on the probability of ruin, as well as matching policies called EXPLOIT. Finally, using a doubling trick over an EXPLOIT policy, we display a Pareto-optimal policy in the case of {-1, 0, 1} rewards, giving an answer to the open problem by Perotto et al. (2019).
翻译:我们研究生存老虎机问题,这是多臂老虎机问题的一个变体,其对累积奖励施加了约束:在每个时间步,智能体获得一个位于[-1, 1]区间内的奖励,若累积奖励低于预设阈值,则过程终止,这一现象被称为“破产”。据我们所知,这是首篇研究破产可能发生但并非必然发生框架的论文。我们首先论证,在标准多臂老虎机问题中定义的亚线性遗憾无法通过任何策略实现,因为单次拉动臂杆可能显著增加破产风险。为此,我们建立了帕累托最优策略框架——这是一类策略集合,其中对于某个实例的累积奖励无法在不牺牲另一实例累积奖励的前提下得到改进。基于此,我们给出了破产概率的严格下界,并提出了与之匹配的EXPLOIT策略。最后,通过对EXPLOIT策略应用倍增技巧,我们在{-1, 0, 1}奖励情形下展示了一个帕累托最优策略,从而回应了Perotto等人(2019)提出的开放性问题。