Online Recommendations for Agents with Discounted Adaptive Preferences

We consider a bandit recommendations problem in which an agent's preferences (representing selection probabilities over recommended items) evolve as a function of past selections, according to an unknown $\textit{preference model}$. In each round, we show a menu of $k$ items (out of $n$ total) to the agent, who then chooses a single item, and we aim to minimize regret with respect to some $\textit{target set}$ (a subset of the item simplex) for adversarial losses over the agent's choices. Extending the setting from Agarwal and Brown (2022), where uniform-memory agents were considered, here we allow for non-uniform memory in which a discount factor is applied to the agent's memory vector at each subsequent round. In the "long-term memory" regime (when the effective memory horizon scales with $T$ sublinearly), we show that efficient sublinear regret is obtainable with respect to the set of $\textit{everywhere instantaneously realizable distributions}$ (the "EIRD set", as formulated in prior work) for any $\textit{smooth}$ preference model. Further, for preferences which are bounded above and below by linear functions of memory weight (we call these "scale-bounded" preferences) we give an algorithm which obtains efficient sublinear regret with respect to nearly the $\textit{entire}$ item simplex. We show an NP-hardness result for expanding to targets beyond EIRD in general. In the "short-term memory" regime (when the memory horizon is constant), we show that scale-bounded preferences again enable efficient sublinear regret for nearly the entire simplex even without smoothness if losses do not change too frequently, yet we show an information-theoretic barrier for competing against the EIRD set under arbitrary smooth preference models even when losses are constant.

翻译：我们考虑一个多臂赌博机推荐问题，其中智能体的偏好（表示对推荐项目的选择概率）根据未知的偏好模型，作为历史选择的函数而演化。在每一轮中，我们向智能体展示一个包含 $k$ 个项目（从总共 $n$ 个项目中选出）的菜单，智能体随后选择一个项目，我们的目标是在对抗性损失下，相对于某个目标集（项目单纯形的一个子集）最小化遗憾。我们扩展了 Agarwal 和 Brown (2022) 中考虑均匀记忆智能体的设置，在此允许非均匀记忆，即在后续每一轮中对智能体的记忆向量应用折扣因子。在“长期记忆”机制中（当有效记忆时间范围与 $T$ 呈亚线性关系时），我们证明对于任何光滑的偏好模型，相对于处处瞬时可实现分布集（即先前工作中定义的“EIRD 集”），可以获得有效的亚线性遗憾。进一步地，对于被记忆权重的线性函数上下界约束的偏好（我们称之为“有界尺度”偏好），我们给出一种算法，该算法相对于几乎整个项目单纯形获得有效的亚线性遗憾。我们证明了一般情况下将目标扩展到 EIRD 之外是 NP-困难的。在“短期记忆”机制中（当记忆时间范围为常数时），我们证明即使没有光滑性，如果损失变化不频繁，有界尺度偏好同样能对几乎整个单纯形实现有效的亚线性遗憾；然而，我们展示了即使在损失为常数的情况下，对于任意光滑偏好模型，相对于 EIRD 集进行竞争时存在信息论障碍。