We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call Bandits with Deterministically Evolving States. The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how ``healthy'' the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user's engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user's preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate $\lambda \in [0,1]$ at which the state evolves (e.g., how fast a user's preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed sequence of arms pulled. We analyze online learning algorithms for any possible parametrization of the evolution rate $\lambda$. Specifically, the regret rates obtained are: for $\lambda \in [0, 1/T^2]$: $\widetilde O(\sqrt{KT})$; for $\lambda = T^{-a/b}$ with $b < a < 2b$: $\widetilde O (T^{b/a})$; for $\lambda \in (1/T, 1 - 1/\sqrt{T}): \widetilde O (K^{1/3}T^{2/3})$; and for $\lambda \in [1 - 1/\sqrt{T}, 1]: \widetilde O (K\sqrt{T})$.
翻译:我们提出了一种在确定性演化的不可观测状态背景下进行带反馈学习的模型,称为状态确定演化的多臂老虎机。该模型的核心应用场景包括推荐系统学习和在线广告学习。在这两种情况下,算法每轮获得的奖励是所选动作的短期收益与系统“健康程度”(即状态度量)的函数。例如,在推荐系统中,平台从用户对特定类型内容的互动中获得的奖励不仅取决于该内容的固有特征,还取决于用户因与其他类型内容互动而演化的偏好状态。我们的通用模型考虑了状态演化的不同速率λ∈[0,1](例如用户偏好因先前内容消费而变化的快慢),并将标准多臂老虎机作为特例包含在内。算法的目标是最小化相对于最优固定臂序列的遗憾值。我们分析了针对任意演化速率λ参数化下的在线学习算法。具体而言,获得的遗憾率为:当λ∈[0, 1/T²]时:Õ(√KT);当λ=T^{-a/b}且b<a<2b时:Õ(T^{b/a});当λ∈(1/T, 1-1/√T)时:Õ(K^{1/3}T^{2/3});当λ∈[1-1/√T, 1]时:Õ(K√T)。