We investigate the problem of stochastic, combinatorial multi-armed bandits where the learner only has access to bandit feedback and the reward function can be non-linear. We provide a general framework for adapting discrete offline approximation algorithms into sublinear $\alpha$-regret methods that only require bandit feedback, achieving $\mathcal{O}\left(T^\frac{2}{3}\log(T)^\frac{1}{3}\right)$ expected cumulative $\alpha$-regret dependence on the horizon $T$. The framework only requires the offline algorithms to be robust to small errors in function evaluation. The adaptation procedure does not even require explicit knowledge of the offline approximation algorithm -- the offline algorithm can be used as black box subroutine. To demonstrate the utility of the proposed framework, the proposed framework is applied to multiple problems in submodular maximization, adapting approximation algorithms for cardinality and for knapsack constraints. The new CMAB algorithms for knapsack constraints outperform a full-bandit method developed for the adversarial setting in experiments with real-world data.
翻译:我们研究了随机组合多臂老虎机问题,其中学习者仅能获得bandit反馈,且奖励函数可以是非线性的。我们提出了一种通用框架,可将离散离线近似算法自适应地转化为仅需bandit反馈的亚线性$\alpha$-遗憾方法,实现了对时间范围$T$的$\mathcal{O}\left(T^\frac{2}{3}\log(T)^\frac{1}{3}\right)$期望累积$\alpha$-遗憾依赖。该框架仅要求离线算法对函数评估中的小误差具有鲁棒性。自适应过程甚至不需要显式了解离线近似算法——离线算法可作为黑箱子程序使用。为展示所提框架的实用性,我们将其应用于子模最大化中的多个问题,分别针对基数约束和背包约束调整了近似算法。基于真实世界数据的实验表明,针对背包约束的新型CMAB算法优于为对抗性环境设计的全bandit方法。