We study the combinatorial semi-bandit problem where an agent selects a subset of base arms and receives individual feedback. While this generalizes the classical multi-armed bandit and has broad applicability, its scalability is limited by the high cost of combinatorial optimization, requiring oracle queries at every round. To tackle this, we propose oracle-efficient frameworks that significantly reduce oracle calls while maintaining tight regret guarantees. For the worst-case linear reward setting, our algorithms achieve $\tilde{O}(\sqrt{T})$ regret using only $O(\log\log T)$ oracle queries. We also propose covariance-adaptive algorithms that leverage noise structure for improved regret, and extend our approach to general (non-linear) rewards. Overall, our methods reduce oracle usage from linear to (doubly) logarithmic in time, with strong theoretical guarantees.
翻译:本文研究组合半赌博机问题,其中智能体选择基础臂的子集并接收个体反馈。虽然该问题推广了经典多臂赌博机并具有广泛适用性,但其可扩展性受限于组合优化的高计算成本——每轮都需要进行预言机查询。为解决此问题,我们提出预言机高效框架,在保持严格遗憾界的同时显著减少预言机调用次数。针对最坏情况线性奖励设定,我们的算法仅需$O(\log\log T)$次预言机查询即可实现$\tilde{O}(\sqrt{T})$遗憾界。我们还提出协方差自适应算法,利用噪声结构改进遗憾界,并将方法扩展至一般(非线性)奖励场景。总体而言,我们的方法将预言机使用量从时间线性降低至(双重)对数级别,并具有坚实的理论保证。