We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in three important settings. On the one hand, we derive the first optimal (up to logarithmic factors) regret bounds for combinatorial semi-bandits with delay and adversarial Markov decision processes with delay (and known transition functions). On the other hand, we use our analysis to derive an efficient algorithm for linear bandits with delay achieving near-optimal regret bounds. Our novel regret decomposition shows that FTRL remains stable across multiple rounds under mild assumptions on the Hessian of the regularizer.
翻译:本文针对带延迟赌博反馈的在线学习问题,提出了对跟随正则化领导者(FTRL)算法的新分析。通过将延迟反馈的代价与赌博反馈的代价分离,我们的分析能够在三个重要场景中获得新结果。一方面,我们首次推导出带延迟的组合半赌博机和带延迟(且转移函数已知)的对抗性马尔可夫决策过程的最优(仅对数因子偏差)遗憾界。另一方面,利用该分析,我们为带延迟的线性赌博机设计了一种高效算法,实现了接近最优的遗憾界。我们的新颖遗憾分解表明:在正则化项的Hessian矩阵满足温和假设的条件下,FTRL在多轮交互中仍能保持稳定性。