Contextual bandits constitute a classical framework for decision-making under uncertainty. In this setting, the goal is to learn the arms of highest reward subject to contextual information, while the unknown reward parameters of each arm need to be learned by experimenting that specific arm. Accordingly, a fundamental problem is that of balancing exploration (i.e., pulling different arms to learn their parameters), versus exploitation (i.e., pulling the best arms to gain reward). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partial context observations remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal arms based on the data of observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation. Specifically, we establish the followings: (i) regret bounds that grow poly-logarithmically with time, (ii) square-root consistency of parameter estimation, and (iii) scaling of the regret with other quantities including dimensions and number of arms. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling. To establish the results, we introduce novel martingale techniques and concentration inequalities to address partially observed dependent random variables generated from unspecified distributions, and also leverage problem-dependent information to sharpen probabilistic bounds for time-varying suboptimality gaps. These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.
翻译:情境多臂赌博机构成了不确定性下决策的经典框架。在该设定中,目标是在情境信息下学习最高奖励的臂,而每个臂的未知奖励参数需要通过试验该特定臂来学习。因此,一个基本问题是在探索(即拉动不同臂以学习其参数)与利用(即拉动最佳臂以获得奖励)之间取得平衡。为研究该问题,现有文献大多考虑了完全可观测的情境。然而,部分情境观测的设定至今仍未得到探索,尽管其在理论上更具普遍性,在实践中更为灵活。我们研究基于观测数据(这些数据是未观测情境向量的带噪线性函数)学习选择最优臂的赌博机策略。理论分析表明,汤普森抽样策略成功平衡了探索与利用。具体而言,我们建立了以下结论:(i) 随时间多项式对数增长的遗憾界;(ii) 参数估计的平方根一致性;(iii) 遗憾与维度及臂数等其他量的缩放关系。同时,我们通过大量真实与合成数据的数值实验,验证了汤普森抽样的有效性。为建立这些结论,我们引入了新型鞅方法与集中不等式,以处理由未指定分布生成的部分可观测依赖随机变量,并利用问题依赖信息来锐化时变次优性间隙的概率界。这些技术为研究其他涉及情境信息与部分观测的决策问题铺平了道路。