We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call \emph{Bandits with Deterministically Evolving States} ($B-DES$). The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how "healthy" the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user's engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user's preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate $\lambda \in [0,1]$ at which the state evolves (e.g., how fast a user's preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed \emph{sequence} of arms pulled, which is significantly harder to attain compared to standard benchmark of the best-fixed action in hindsight. We present online learning algorithms for any possible value of the evolution rate $\lambda$ and we show the robustness of our results to various model misspecifications.
翻译:我们提出了一种在考虑确定性演化且不可观测状态的情况下,基于赌博机反馈进行学习的模型,并将其称为“确定性演化状态赌博机”(B-DES)。该模型的主要应用场景包括推荐系统学习和在线广告学习。在这两种情况下,算法每轮获得的奖励是所选动作的短期收益与系统“健康”状态(即由其状态衡量)的函数。例如,在推荐系统中,平台从用户与特定类型内容互动中获得的奖励,不仅取决于该内容本身的固有特征,还取决于用户因接触平台上其他内容而演变的偏好。我们的通用模型考虑了状态以不同速率λ∈[0,1]演化的情形(例如,用户偏好因先前内容消费而转变的速度),并将标准的多臂赌博机作为特例包含其中。算法的目标是针对最优固定的拉臂“序列”最小化遗憾指标,这相较于事后最优固定动作这一标准基准而言,实现难度显著更大。我们针对演化速率λ的任何可能取值提出了在线学习算法,并展示了我们的结果对于多种模型错误设定的鲁棒性。