We propose a novel Bayesian-Optimistic Frequentist Upper Confidence Bound (BOF-UCB) algorithm for stochastic contextual linear bandits in non-stationary environments. This unique combination of Bayesian and frequentist principles enhances adaptability and performance in dynamic settings. The BOF-UCB algorithm utilizes sequential Bayesian updates to infer the posterior distribution of the unknown regression parameter, and subsequently employs a frequentist approach to compute the Upper Confidence Bound (UCB) by maximizing the expected reward over the posterior distribution. We provide theoretical guarantees of BOF-UCB's performance and demonstrate its effectiveness in balancing exploration and exploitation on synthetic datasets and classical control tasks in a reinforcement learning setting. Our results show that BOF-UCB outperforms existing methods, making it a promising solution for sequential decision-making in non-stationary environments.
翻译:我们提出了一种新颖的贝叶斯乐观频率上置信界(BOF-UCB)算法,用于非平稳环境下的随机上下文线性赌博机。该算法独特地融合了贝叶斯与频率学派原理,增强了在动态场景中的适应性和性能。BOF-UCB算法通过序贯贝叶斯更新推断未知回归参数的后验分布,随后采用频率学派方法,通过最大化后验分布上的期望奖励来计算上置信界(UCB)。我们提供了BOF-UCB性能的理论保证,并在合成数据集以及强化学习场景下的经典控制任务中,证明了其在探索与利用权衡方面的有效性。结果表明,BOF-UCB优于现有方法,成为非平稳环境下序贯决策问题的一个有前景的解决方案。