We consider the bandit problem of selecting $K$ out of $N$ arms at each time step. The reward can be a non-linear function of the rewards of the selected individual arms. The direct use of a multi-armed bandit algorithm requires choosing among $\binom{N}{K}$ options, making the action space large. To simplify the problem, existing works on combinatorial bandits {typically} assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for top-$K$ subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in $N$. Further, DART achieves a regret bound of $\tilde{\mathcal{O}}(K\sqrt{KNT})$ for a time horizon $T$, which matches the lower bound in bandit feedback up to a factor of $\sqrt{\log{2NT}}$. When applied to the problem of cross-selling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of state-of-the-art algorithms. We also show that DART significantly outperforms existing methods for both linear and non-linear joint reward environments.
翻译:我们考虑在每一步从N个臂中选择K个臂的赌博机问题。所选个体臂的奖励可以是非线性函数。直接使用多臂赌博机算法需要在$\binom{N}{K}$个选项中进行选择,导致动作空间过大。为简化问题,现有组合赌博机研究通常假设反馈是各臂奖励的线性函数。本文证明了在奖励可能相关的情况下,基于赌博机反馈的top-K子集选择的下界。我们提出了一种新颖的组合赌博机算法,该算法既不依赖个体臂反馈,也不要求奖励函数具有线性特性。此外,我们的算法适用于存在相关性的个体臂奖励场景。所提出的自适应接受拒绝算法(DART)基于置信区间顺序筛选优质臂并淘汰劣质臂。DART具有计算高效性,且存储复杂度与N呈线性关系。进一步证明,在时间范围T内,DART实现了$\tilde{\mathcal{O}}(K\sqrt{KNT})$的遗憾上界,该结果与赌博机反馈下的下界仅相差$\sqrt{\log{2NT}}$因子。在交叉销售优化和个体奖励均值最大化问题中,本算法性能超越现有最优算法。实验还表明,在线性和非线性联合奖励环境下,DART均显著优于现有方法。