Despite the subject of non-stationary bandit learning having attracted much recent attention, we have yet to identify a formal definition of non-stationarity that can consistently distinguish non-stationary bandits from stationary ones. Prior work has characterized non-stationary bandits as bandits for which the reward distribution changes over time. We demonstrate that this definition can ambiguously classify the same bandit as both stationary and non-stationary; this ambiguity arises in the existing definition's dependence on the latent sequence of reward distributions. Moreover, the definition has given rise to two widely used notions of regret: the dynamic regret and the weak regret. These notions are not indicative of qualitative agent performance in some bandits. Additionally, this definition of non-stationary bandits has led to the design of agents that explore excessively. We introduce a formal definition of non-stationary bandits that resolves these issues. Our new definition provides a unified approach, applicable seamlessly to both Bayesian and frequentist formulations of bandits. Furthermore, our definition ensures consistent classification of two bandits offering agents indistinguishable experiences, categorizing them as either both stationary or both non-stationary. This advancement provides a more robust framework for non-stationary bandit learning.
翻译:尽管非平稳赌博机学习近年来受到广泛关注,但尚未有一个正式定义能够始终一致地区分非平稳赌博机与平稳赌博机。此前的研究将非平稳赌博机定义为奖励分布随时间变化的赌博机。我们证明,这一定义可能模糊地将同一赌博机既归类为平稳又归类为非平稳;这种模糊性源于现有定义对潜在奖励分布序列的依赖。此外,这一定义催生了两种常用的遗憾概念:动态遗憾与弱遗憾。在某些赌博机中,这些概念并不能反映智能体的定性表现。同时,非平稳赌博机的这一定义还导致了智能体过度探索的问题。我们提出了一种正式的非平稳赌博机定义以解决上述问题。新定义提供了一种统一方法,可无缝适用于贝叶斯与频率学派两种赌博机框架。此外,该定义确保对提供智能体无法区分经验的两个赌博机进行一致分类,将它们统一归为平稳或非平稳。这一进展为非平稳赌博机学习提供了更稳健的框架。