We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call Bandits with Deterministically Evolving States ($B$-$DES$). The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how "healthy" the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user's engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user's preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate $\lambda \in [0,1]$ at which the state evolves (e.g., how fast a user's preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed sequence of arms pulled, which is significantly harder to attain compared to standard benchmark of the best-fixed action in hindsight. We present online learning algorithms for any possible value of the evolution rate $\lambda$ and we show the robustness of our results to various model misspecifications.
翻译:我们提出了一种在考虑确定性演化且不可观测状态下的赌博机反馈学习模型,称为确定性演化状态赌博机($B$-$DES$)。该模型的核心应用场景包括推荐系统学习和在线广告学习。在这两种情况下,算法在每轮获得的奖励既取决于所选动作的即时收益,也与系统的"健康度"(即其状态度量)相关。例如在推荐系统中,平台从用户与特定类型内容的互动中获得的收益,不仅取决于该内容本身的内在特征,还受到用户因接触平台上其他类型内容而产生的偏好演化的影响。我们的通用模型通过演化速率参数 $\lambda \in [0,1]$ 刻画状态变化的不同强度(例如用户因历史内容消费导致偏好转变的速度),并将标准多臂赌博机作为特例包含其中。算法的目标在于最小化相对于最优固定臂序列的遗憾度量——这相比事后最优固定动作的标准基准更具挑战性。我们针对任意演化速率 $\lambda$ 提出了在线学习算法,并证明了研究结果对各类模型设定偏差的鲁棒性。