In a recent work, Laforgue et al. introduce the model of last switch dependent (LSD) bandits, in an attempt to capture nonstationary phenomena induced by the interaction between the player and the environment. Examples include satiation, where consecutive plays of the same action lead to decreased performance, or deprivation, where the payoff of an action increases after an interval of inactivity. In this work, we take a step towards understanding the approximability of planning LSD bandits, namely, the (NP-hard) problem of computing an optimal arm-pulling strategy under complete knowledge of the model. In particular, we design the first efficient constant approximation algorithm for the problem and show that, under a natural monotonicity assumption on the payoffs, its approximation guarantee (almost) matches the state-of-the-art for the special and well-studied class of recharging bandits (also known as delay-dependent). In this attempt, we develop new tools and insights for this class of problems, including a novel higher-dimensional relaxation and the technique of mirroring the evolution of virtual states. We believe that these novel elements could potentially be used for approaching richer classes of action-induced nonstationary bandits (e.g., special instances of restless bandits). In the case where the model parameters are initially unknown, we develop an online learning adaptation of our algorithm for which we provide sublinear regret guarantees against its full-information counterpart.
翻译:在近期工作中,Laforgue等人提出了最后开关依赖型(LSD)赌博机模型,旨在刻画由玩家与环境交互导致的非平稳现象。典型例子包括:饱和效应(连续执行同一动作导致性能下降)与剥夺效应(动作经过不活跃间隔后收益增加)。本文致力于理解LSD赌博机规划问题的可逼近性——即在完全知晓模型参数的前提下,计算最优拉臂策略这一NP困难问题。我们首次为该问题设计了具有常数近似比的高效算法,并证明在收益函数满足自然单调性的条件下,其近似保证(几乎)匹配了特殊且已被充分研究的再充电型赌博机(亦称延迟依赖型)领域的最新成果。在此过程中,我们为该类问题开发了新型工具与见解,包括一种新颖的高维松弛技术以及虚拟状态演化镜像方法。我们相信这些创新要素有望用于处理更广泛的动作诱发型非平稳赌博机问题(例如休息赌博机的特殊实例)。针对模型参数初始未知的场景,我们开发了算法的在线学习变体,并提供了相对于全信息版本次线性遗憾界的保证。