We consider a sequential stochastic multi-armed bandit problem where the agent interacts with bandit over multiple episodes. The reward distribution of the arms remain constant throughout an episode but can change over different episodes. We propose an algorithm based on UCB to transfer the reward samples from the previous episodes and improve the cumulative regret performance over all the episodes. We provide regret analysis and empirical results for our algorithm, which show significant improvement over the standard UCB algorithm without transfer.
翻译:我们研究了一个序贯随机多臂赌博机问题,其中智能体在多个回合中与赌博机进行交互。各臂的奖励分布在一个回合内保持不变,但在不同回合间可能发生变化。我们提出了一种基于UCB的算法,通过迁移先前回合的奖励样本,以提升所有回合的累计遗憾性能。我们对该算法进行了遗憾分析和实验验证,结果表明相比无迁移的标准UCB算法,该算法取得了显著改进。