We address the problem of stochastic combinatorial semi-bandits, where a player can select from P subsets of a set containing d base items. Most existing algorithms (e.g. CUCB, ESCB, OLS-UCB) require prior knowledge on the reward distribution, like an upper bound on a sub-Gaussian proxy-variance, which is hard to estimate tightly. In this work, we design a variance-adaptive version of OLS-UCB, relying on an online estimation of the covariance structure. Estimating the coefficients of a covariance matrix is much more manageable in practical settings and results in improved regret upper bounds compared to proxy variance-based algorithms. When covariance coefficients are all non-negative, we show that our approach efficiently leverages the semi-bandit feedback and provably outperforms bandit feedback approaches, not only in exponential regimes where P $\gg$ d but also when P $\le$ d, which is not straightforward from most existing analyses.
翻译:我们研究了随机组合半赌博问题,其中玩家可以从包含 d 个基项集合中的 P 个子集中进行选择。现有的大多数算法(例如 CUCB、ESCB、OLS-UCB)需要关于奖励分布的先验知识,比如次高斯代理方差的上界,而这很难精确估计。在本工作中,我们设计了一种方差自适应的 OLS-UCB 版本,其依赖于对协方差结构的在线估计。在实际场景中,估计协方差矩阵的系数更为可行,并且相比基于代理方差的算法,能带来改进的遗憾上界。当协方差系数均为非负时,我们证明该方法能够高效利用半赌博反馈,并且不仅能在 P ≫ d 的指数场景中,还能在 P ≤ d 的情况下(这一点多数现有分析并不直接成立)证明其性能优于赌博反馈方法。