We consider the problem of designing contextual bandit algorithms in the ``cross-learning'' setting of Balseiro et al., where the learner observes the loss for the action they play in all possible contexts, not just the context of the current round. We specifically consider the setting where losses are chosen adversarially and contexts are sampled i.i.d. from an unknown distribution. In this setting, we resolve an open problem of Balseiro et al. by providing an efficient algorithm with a nearly tight (up to logarithmic factors) regret bound of $\widetilde{O}(\sqrt{TK})$, independent of the number of contexts. As a consequence, we obtain the first nearly tight regret bounds for the problems of learning to bid in first-price auctions (under unknown value distributions) and sleeping bandits with a stochastic action set. At the core of our algorithm is a novel technique for coordinating the execution of a learning algorithm over multiple epochs in such a way to remove correlations between estimation of the unknown distribution and the actions played by the algorithm. This technique may be of independent interest for other learning problems involving estimation of an unknown context distribution.
翻译:我们研究了Balseiro等人提出的“交叉学习”设置下设计情境赌博机算法的问题,其中学习者在所有可能的上下文中观察到其采取动作的损失,而不仅仅是当前回合的上下文。具体考虑损失由对抗性选择且上下文从未知分布中独立同分布采样得到的设置。在此设置中,我们通过提供一个高效算法解决了Balseiro等人的一个开放问题,该算法实现了接近最优(仅差对数因子)的遗憾界$\widetilde{O}(\sqrt{TK})$,且与上下文数量无关。作为推论,我们首次获得了第一价格拍卖中的出价学习(在价值分布未知情形下)以及随机动作集的休眠赌博机问题的接近最优遗憾界。该算法的核心是一种新颖的协调技术,能够在多个阶段中执行学习算法,从而消除对未知分布的估计与算法所采取动作之间的相关性。该技术对于其他涉及未知上下文分布估计的学习问题可能具有独立价值。