This paper investigates the problem of combinatorial multiarmed bandits with stochastic submodular (in expectation) rewards and full-bandit delayed feedback, where the delayed feedback is assumed to be composite and anonymous. In other words, the delayed feedback is composed of components of rewards from past actions, with unknown division among the sub-components. Three models of delayed feedback: bounded adversarial, stochastic independent, and stochastic conditionally independent are studied, and regret bounds are derived for each of the delay models. Ignoring the problem dependent parameters, we show that regret bound for all the delay models is $\tilde{O}(T^{2/3} + T^{1/3} \nu)$ for time horizon $T$, where $\nu$ is a delay parameter defined differently in the three cases, thus demonstrating an additive term in regret with delay in all the three delay models. The considered algorithm is demonstrated to outperform other full-bandit approaches with delayed composite anonymous feedback.
翻译:本文研究了具有随机子模(期望意义上)奖励和全赌博机延迟反馈的组合多臂赌博机问题,其中延迟反馈被假设为复合且匿名的。换言之,延迟反馈由过去动作奖励的组成部分构成,且各子成分之间的划分未知。我们研究了三种延迟反馈模型:有界对抗性、随机独立性和随机条件独立性,并为每种延迟模型推导了遗憾界。忽略问题依赖参数后,我们证明对于所有延迟模型,遗憾界为$\tilde{O}(T^{2/3} + T^{1/3} \nu)$,其中时间范围为$T$,$\nu$是在三种情况下定义各异的延迟参数,从而表明在所有三种延迟模型中,遗憾值均存在与延迟相关的加性项。实验证明,所考虑的算法在处理延迟复合匿名反馈时优于其他全赌博机方法。