The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback achieves a regret bound where the penalty for the delays is independent of the horizon. This result significantly improves upon existing work, where the best known regret bound has the delay penalty increasing with the horizon. We verify our theoretical results through experiments on simulated data.
翻译:随机广义线性老虎机是序列决策问题中一个已被充分理解的模型,许多算法在即时反馈下能实现接近最优的遗憾保证。然而,在许多实际应用中,奖励几乎总是延迟的,这导致即时奖励的严格要求无法满足。我们以理论方式研究了广义线性老虎机中的奖励延迟现象。我们证明,一种乐观算法对延迟反馈的自然适应能够实现一个遗憾上界,其中延迟的惩罚与时间范围无关。这一结果显著优于现有工作,因为目前已知的最佳遗憾上界中,延迟惩罚会随时间范围增加而增大。我们通过模拟数据实验验证了我们的理论结果。