We consider contextual bandits with graph feedback, a class of interactive learning problems with richer structures than vanilla contextual bandits, where taking an action reveals the rewards for all neighboring actions in the feedback graph under all contexts. Unlike the multi-armed bandits setting where a growing literature has painted a near-complete understanding of graph feedback, much remains unexplored in the contextual bandits counterpart. In this paper, we make inroads into this inquiry by establishing a regret lower bound $\Omega(\sqrt{\beta_M(G) T})$, where $M$ is the number of contexts, $G$ is the feedback graph, and $\beta_M(G)$ is our proposed graph-theoretic quantity that characterizes the fundamental learning limit for this class of problems. Interestingly, $\beta_M(G)$ interpolates between $\alpha(G)$ (the independence number of the graph) and $\mathsf{m}(G)$ (the maximum acyclic subgraph (MAS) number of the graph) as the number of contexts $M$ varies. We also provide algorithms that achieve near-optimal regret for important classes of context sequences and/or feedback graphs, such as transitively closed graphs that find applications in auctions and inventory control. In particular, with many contexts, our results show that the MAS number essentially characterizes the statistical complexity for contextual bandits, as opposed to the independence number in multi-armed bandits.
翻译:我们研究具有图反馈的上下文赌博机,这是一类比普通上下文赌博机结构更丰富的交互式学习问题,其中采取一个动作会揭示反馈图中所有相邻动作在所有上下文下的奖励。与多臂赌博机领域已有大量文献近乎完整地阐释了图反馈机制不同,上下文赌博机方向仍存在许多未探索的问题。本文通过建立遗憾下界 $\Omega(\sqrt{\beta_M(G) T})$ 切入这一研究,其中 $M$ 为上下文数量,$G$ 为反馈图,$\beta_M(G)$ 是我们提出的图论量值,用于刻画此类问题的根本学习极限。有趣的是,随着上下文数量 $M$ 的变化,$\beta_M(G)$ 的值在图的独立数 $\alpha(G)$ 与最大无圈子图数 $\mathsf{m}(G)$ 之间连续过渡。我们还针对重要类别的上下文序列和/或反馈图(例如在拍卖和库存控制中具有应用的传递闭包图)提供了实现近乎最优遗憾的算法。特别地,当存在大量上下文时,我们的结果表明最大无圈子图数本质上刻画了上下文赌博机的统计复杂度,这与多臂赌博机中独立数所起的作用形成对比。