A contextual bandit is a popular framework for online learning to act under uncertainty. In practice, the number of actions is huge and their expected rewards are correlated. In this work, we introduce a general framework for capturing such correlations through a mixed-effect model where actions are related through multiple shared effect parameters. To explore efficiently using this structure, we propose Mixed-Effect Thompson Sampling (meTS) and bound its Bayes regret. The regret bound has two terms, one for learning the action parameters and the other for learning the shared effect parameters. The terms reflect the structure of our model and the quality of priors. Our theoretical findings are validated empirically using both synthetic and real-world problems. We also propose numerous extensions of practical interest. While they do not come with guarantees, they perform well empirically and show the generality of the proposed framework.
翻译:上下文强盗是在线学习不确定性下决策的常用框架。在实际应用中,动作数量庞大且其期望奖励相互关联。本文提出一个通用框架,通过混合效应模型捕捉此类关联——其中动作通过多个共享效应参数相互关联。为利用该结构进行高效探索,我们提出混合效应汤普森采样算法(meTS)并推导其贝叶斯遗憾界。该遗憾界包含两项:一项对应动作参数的学习,另一项对应共享效应参数的学习。这两项反映了模型结构和先验质量。通过合成数据与真实场景的实验,我们的理论结果得到实证验证。此外,我们提出多项具有实践价值的扩展方案。尽管这些扩展缺乏理论保证,但它们在实验中表现良好,并展现了所提框架的通用性。