In this work, we study clustered contextual bandits where rewards and resource consumption are the outcomes of cluster-specific linear models. The arms are divided in clusters, with the cluster memberships being unknown to an algorithm. Pulling an arm in a time period results in a reward and in consumption for each one of multiple resources, and with the total consumption of any resource exceeding a constraint implying the termination of the algorithm. Thus, maximizing the total reward requires learning not only models about the reward and the resource consumption, but also cluster memberships. We provide an algorithm that achieves regret sublinear in the number of time periods, without requiring access to all of the arms. In particular, we show that it suffices to perform clustering only once to a randomly selected subset of the arms. To achieve this result, we provide a sophisticated combination of techniques from the literature of econometrics and of bandits with constraints.
翻译:本文研究聚类上下文赌博机问题,其中奖励与资源消耗均服从特定聚类的线性模型。臂被分为若干未知聚类,算法无法获知聚类归属。在每一时段拉动一个臂会产生奖励,同时消耗多种资源,当任意资源的总消耗超过约束阈值时算法终止。因此,最大化总奖励需要同时学习奖励模型、资源消耗模型以及聚类归属。我们提出的算法在无需访问所有臂的条件下实现了时间周期次线性遗憾。特别地,研究表明仅需对随机选择的臂子集进行一次聚类即可实现该结果。为此,我们巧妙结合了计量经济学与带约束赌博机文献中的技术方法。