We investigate the problem of stochastic, combinatorial multi-armed bandits where the learner only has access to bandit feedback and the reward function can be non-linear. We provide a general framework for adapting discrete offline approximation algorithms into sublinear $\alpha$-regret methods that only require bandit feedback, achieving $\mathcal{O}\left(T^\frac{2}{3}\log(T)^\frac{1}{3}\right)$ expected cumulative $\alpha$-regret dependence on the horizon $T$. The framework only requires the offline algorithms to be robust to small errors in function evaluation. The adaptation procedure does not even require explicit knowledge of the offline approximation algorithm -- the offline algorithm can be used as a black box subroutine. To demonstrate the utility of the proposed framework, the proposed framework is applied to diverse applications in submodular maximization. The new CMAB algorithms for submodular maximization with knapsack constraints outperform a full-bandit method developed for the adversarial setting in experiments with real-world data.
翻译:本文研究了随机组合多臂老虎机问题,其中学习者仅能获得赌博反馈,且奖励函数可能为非线性。我们提出一个通用框架,可将离散离线近似算法转化为仅需赌博反馈的次线性$\alpha$-遗憾方法,并实现了关于时间跨度$T$的$\mathcal{O}\left(T^\frac{2}{3}\log(T)^\frac{1}{3}\right)$期望累积$\alpha$-遗憾依赖关系。该框架仅要求离线算法对函数评价中的小误差具有鲁棒性,且适应过程甚至无需显式知晓离线近似算法——离线算法可作为黑箱子程序使用。为验证所提框架的实用性,我们将其应用于子模最大化的多种场景。实验表明,在真实数据下,针对带背包约束的子模最大化问题的新CMAB算法性能优于为对抗性场景开发的全赌博方法。