We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call Bandits with Deterministically Evolving States ($B$-$DES$). The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how "healthy" the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user's engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user's preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate $\lambda \in [0,1]$ at which the state evolves (e.g., how fast a user's preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed sequence of arms pulled, which is significantly harder to attain compared to standard benchmark of the best-fixed action in hindsight. We present online learning algorithms for any possible value of the evolution rate $\lambda$ and we show the robustness of our results to various model misspecifications.
翻译:我们提出一种在考虑确定性演化且不可观测状态时进行匪徒反馈学习的模型,称为“确定性演化状态匪徒”(B-DES)。该模型的主要应用场景是推荐系统学习和在线广告学习。在这两种情形中,算法每轮获得的奖励既取决于所选动作的短期收益,也取决于系统的“健康度”(即由状态衡量的系统状况)。例如在推荐系统中,平台从用户与特定内容互动中获得的奖励不仅取决于该内容的固有特征,还取决于用户因接触平台其他类型内容而演化的偏好。我们的通用模型考虑了状态以不同速率λ∈[0,1]演化的情形(如用户偏好因历史内容消费而变化的速度),并将标准多臂匪徒问题作为特例纳入其中。算法的目标是最小化针对最优固定动作序列的遗憾度量——这比传统基于事后最优固定动作的基准更为困难。我们针对任意演化速率λ提出了在线学习算法,并证明了结果对多种模型错误设定的鲁棒性。