Weighted Tallying Bandits: Overcoming Intractability via Repeated Exposure Optimality

In recommender system or crowdsourcing applications of online learning, a human's preferences or abilities are often a function of the algorithm's recent actions. Motivated by this, a significant line of work has formalized settings where an action's loss is a function of the number of times that action was recently played in the prior $m$ timesteps, where $m$ corresponds to a bound on human memory capacity. To more faithfully capture decay of human memory with time, we introduce the Weighted Tallying Bandit (WTB), which generalizes this setting by requiring that an action's loss is a function of a \emph{weighted} summation of the number of times that arm was played in the last $m$ timesteps. This WTB setting is intractable without further assumption. So we study it under Repeated Exposure Optimality (REO), a condition motivated by the literature on human physiology, which requires the existence of an action that when repetitively played will eventually yield smaller loss than any other sequence of actions. We study the minimization of the complete policy regret (CPR), which is the strongest notion of regret, in WTB under REO. Since $m$ is typically unknown, we assume we only have access to an upper bound $M$ on $m$. We show that for problems with $K$ actions and horizon $T$, a simple modification of the successive elimination algorithm has $O \left( \sqrt{KT} + (m+M)K \right)$ CPR. Interestingly, upto an additive (in lieu of mutliplicative) factor in $(m+M)K$, this recovers the classical guarantee for the simpler stochastic multi-armed bandit with traditional regret. We additionally show that in our setting, any algorithm will suffer additive CPR of $\Omega \left( mK + M \right)$, demonstrating our result is nearly optimal. Our algorithm is computationally efficient, and we experimentally demonstrate its practicality and superiority over natural baselines.

翻译：在推荐系统或众包等在线学习应用中，人类偏好或能力往往取决于算法近期采取的动作。受此启发，一系列重要工作将动作损失形式化为该动作在前$m$个时间步内被选中的次数的函数，其中$m$对应人类记忆容量的界限。为更真实地刻画人类记忆随时间衰减的特性，我们提出了加权统计老虎机（WTB）模型。该模型要求动作损失是前$m$个时间步内该臂被选中次数的加权求和函数，从而推广了原有设定。若不引入额外假设，WTB问题难以求解。为此，我们在重复暴露最优性（REO）条件下对其进行研究——该条件受人类生理学文献启发，要求存在某个动作，当其被重复执行时终将比任何其他动作序列产生更小的损失。我们针对REO条件下的WTB模型，研究完整策略遗憾（CPR）的最小化问题——这是最严格的遗憾定义。由于$m$通常未知，我们假设仅能获得其上限$M$。研究表明，对于含$K个动作、时间范围为$T$的问题，对连续消除算法进行简单修改后，其CPR可达$O(\sqrt{KT} + (m+M)K)$。值得注意的是，与经典随机多臂老虎机传统遗憾的界相比，本结果仅相差$(m+M)K$的加性因子（而非乘性因子）。此外，我们证明了在此设定下任何算法的CPR必然达到$\Omega(mK + M)$的下界，表明所提结果近乎最优。本算法计算高效，实验验证了其实用性及相较于自然基准方法的优越性。