We study a novel variant of the parameterized bandits problem in which the learner can observe additional auxiliary feedback that is correlated with the observed reward. The auxiliary feedback is readily available in many real-life applications, e.g., an online platform that wants to recommend the best-rated services to its users can observe the user's rating of service (rewards) and collect additional information like service delivery time (auxiliary feedback). In this paper, we first develop a method that exploits auxiliary feedback to build a reward estimator with tight confidence bounds, leading to a smaller regret. We then characterize the regret reduction in terms of the correlation coefficient between reward and its auxiliary feedback. Experimental results in different settings also verify the performance gain achieved by our proposed method.
翻译:我们研究了一种新型参数化赌博机问题变体,其中学习器可以观测到与所得奖励相关的额外辅助反馈。这类辅助反馈在许多实际应用中普遍存在,例如,旨在向用户推荐最优评分服务的在线平台能够观测用户对服务的评分(奖励),并收集服务交付时间等额外信息(辅助反馈)。本文首先提出一种利用辅助反馈构建具有紧置信界的奖励估计器的方法,从而降低累积遗憾。随后,我们以奖励与其辅助反馈之间的相关系数为指标,揭示了遗憾减少的量化特征。不同实验设置下的结果也验证了所提方法实现的性能增益。