On typical modern platforms, users are only able to try a small fraction of the available items. This makes it difficult to model the exploration behavior of platform users as typical online learners who explore all the items. Towards addressing this issue, we propose to interpret a recommender system as a bandit exploration coordinator that provides counterfactual information updates. In particular, we introduce a novel algorithm called Counterfactual UCB (CFUCB) which is guarantees user exploration coordination with bounded regret under the presence of linear representations. Our results show that sharing information is a Subgame Perfect Nash Equilibrium for agents in terms of regret, leading to each agent achieving bounded regret. This approach has potential applications in personalized recommender systems and adaptive experimentation.
翻译:在典型的现代平台上,用户仅能尝试可用项目中的一小部分。这使得将平台用户的探索行为建模为探索所有项目的典型在线学习者变得困难。为解决该问题,我们提出将推荐系统解释为提供反事实信息更新的赌博机探索协调器。具体而言,我们引入了一种名为反事实UCB(CFUCB)的新算法,该算法在线性表示存在的情况下,能保证用户探索协调具有有界遗憾。我们的结果表明,从遗憾角度而言,信息共享是智能体间的子博弈完美纳什均衡,这促使每个智能体实现有界遗憾。该方法在个性化推荐系统与自适应实验中具有潜在应用价值。