Multi-armed bandit (MAB) problems are mainly studied under two extreme settings known as stochastic and adversarial. These two settings, however, do not capture realistic environments such as search engines and marketing and advertising, in which rewards stochastically change in time. Motivated by that, we introduce and study a dynamic MAB problem with stochastic temporal structure, where the expected reward of each arm is governed by an auto-regressive (AR) model. Due to the dynamic nature of the rewards, simple "explore and commit" policies fail, as all arms have to be explored continuously over time. We formalize this by characterizing a per-round regret lower bound, where the regret is measured against a strong (dynamic) benchmark. We then present an algorithm whose per-round regret almost matches our regret lower bound. Our algorithm relies on two mechanisms: (i) alternating between recently pulled arms and unpulled arms with potential, and (ii) restarting. These mechanisms enable the algorithm to dynamically adapt to changes and discard irrelevant past information at a suitable rate. In numerical studies, we further demonstrate the strength of our algorithm under non-stationary settings.
翻译:多臂赌博机(MAB)问题主要在两种极端设定下研究:随机设定与对抗设定。然而,这两种设定无法刻画搜索引擎、市场营销及广告等现实环境,在这些环境中,奖励随时间随机变化。受此启发,我们提出并研究了一个具有随机时间结构的动态MAB问题,其中每个臂的期望奖励由自回归(AR)模型控制。由于奖励的动态特性,简单的"探索后利用"策略会失效,因为所有臂都需要持续探索。我们通过刻画每轮遗憾下界来形式化这一问题,其中遗憾是相对于一个强(动态)基准衡量的。随后我们提出一种算法,其每轮遗憾几乎与遗憾下界匹配。该算法依赖于两种机制:(i)在近期被拉动的臂与具有潜力的未拉动臂之间交替选择,以及(ii)重启。这些机制使算法能够动态适应变化,并以适当速度抛弃无关的过往信息。在数值研究中,我们进一步展示了该算法在非平稳环境下的优势。