We address the problem of stochastic combinatorial semi-bandits, where a player selects among P actions from the power set of a set containing d base items. Adaptivity to the problem's structure is essential in order to obtain optimal regret upper bounds. As estimating the coefficients of a covariance matrix can be manageable in practice, leveraging them should improve the regret. We design "optimistic" covariance-adaptive algorithms relying on online estimations of the covariance structure, called OLS-UCB-C and COS-V (only the variances for the latter). They both yields improved gap-free regret. Although COS-V can be slightly suboptimal, it improves on computational complexity by taking inspiration from ThompsonSampling approaches. It is the first sampling-based algorithm satisfying a T^1/2 gap-free regret (up to poly-logs). We also show that in some cases, our approach efficiently leverages the semi-bandit feedback and outperforms bandit feedback approaches, not only in exponential regimes where P >> d but also when P <= d, which is not covered by existing analyses.
翻译:我们研究随机组合半赌博机问题,其中玩家从包含d个基础项的集合的幂集中选择P个动作。为获得最优的遗憾上界,对问题结构的自适应能力至关重要。由于协方差矩阵系数的估计在实践中具有可操作性,利用这些系数应能改善遗憾。我们设计了基于协方差结构在线估计的"乐观"协方差自适应算法,称为OLS-UCB-C和COS-V(后者仅利用方差)。两种算法均能实现改进的无间隙遗憾。尽管COS-V可能存在轻微次优性,但通过借鉴ThompsonSampling方法的思想,其在计算复杂度方面有所提升。这是首个满足T^1/2无间隙遗憾(忽略多对数项)的基于采样的算法。我们还证明,在某些情况下,我们的方法能有效利用半赌博机反馈机制,其性能不仅优于P >> d的指数级场景下的赌博机反馈方法,在P <= d的情况下同样表现更优,而现有分析尚未涵盖后一情形。