We consider a bandit recommendations problem in which an agent's preferences (representing selection probabilities over recommended items) evolve as a function of past selections, according to an unknown $\textit{preference model}$. In each round, we show a menu of $k$ items (out of $n$ total) to the agent, who then chooses a single item, and we aim to minimize regret with respect to some $\textit{target set}$ (a subset of the item simplex) for adversarial losses over the agent's choices. Extending the setting from Agarwal and Brown (2022), where uniform-memory agents were considered, here we allow for non-uniform memory in which a discount factor is applied to the agent's memory vector at each subsequent round. In the "long-term memory" regime (when the effective memory horizon scales with $T$ sublinearly), we show that efficient sublinear regret is obtainable with respect to the set of $\textit{everywhere instantaneously realizable distributions}$ (the "EIRD set", as formulated in prior work) for any $\textit{smooth}$ preference model. Further, for preferences which are bounded above and below by linear functions of memory weight (we call these "scale-bounded" preferences) we give an algorithm which obtains efficient sublinear regret with respect to nearly the $\textit{entire}$ item simplex. We show an NP-hardness result for expanding to targets beyond EIRD in general. In the "short-term memory" regime (when the memory horizon is constant), we show that scale-bounded preferences again enable efficient sublinear regret for nearly the entire simplex even without smoothness if losses do not change too frequently, yet we show an information-theoretic barrier for competing against the EIRD set under arbitrary smooth preference models even when losses are constant.
翻译:我们考虑一个多臂赌博机推荐问题,其中智能体的偏好(表示对推荐项目的选择概率)根据未知的偏好模型,作为历史选择的函数而演化。在每一轮中,我们向智能体展示一个包含 $k$ 个项目(从总共 $n$ 个项目中选出)的菜单,智能体随后选择一个项目,我们的目标是在对抗性损失下,相对于某个目标集(项目单纯形的一个子集)最小化遗憾。我们扩展了 Agarwal 和 Brown (2022) 中考虑均匀记忆智能体的设置,在此允许非均匀记忆,即在后续每一轮中对智能体的记忆向量应用折扣因子。在“长期记忆”机制中(当有效记忆时间范围与 $T$ 呈亚线性关系时),我们证明对于任何光滑的偏好模型,相对于处处瞬时可实现分布集(即先前工作中定义的“EIRD 集”),可以获得有效的亚线性遗憾。进一步地,对于被记忆权重的线性函数上下界约束的偏好(我们称之为“有界尺度”偏好),我们给出一种算法,该算法相对于几乎整个项目单纯形获得有效的亚线性遗憾。我们证明了一般情况下将目标扩展到 EIRD 之外是 NP-困难的。在“短期记忆”机制中(当记忆时间范围为常数时),我们证明即使没有光滑性,如果损失变化不频繁,有界尺度偏好同样能对几乎整个单纯形实现有效的亚线性遗憾;然而,我们展示了即使在损失为常数的情况下,对于任意光滑偏好模型,相对于 EIRD 集进行竞争时存在信息论障碍。