A more general formulation of the linear bandit problem is considered to allow for dependencies over time. Specifically, it is assumed that there exists an unknown $\mathbb{R}^d$-valued stationary $\varphi$-mixing sequence of parameters $(\theta_t,~t \in \mathbb{N})$ which gives rise to pay-offs. This instance of the problem can be viewed as a generalization of both the classical linear bandits with iid noise, and the finite-armed restless bandits. In light of the well-known computational hardness of optimal policies for restless bandits, an approximation is proposed whose error is shown to be controlled by the $\varphi$-dependence between consecutive $\theta_t$. An optimistic algorithm, called LinMix-UCB, is proposed for the case where $\theta_t$ has an exponential mixing rate. The proposed algorithm is shown to incur a sub-linear regret of $\mathcal{O}\left(\sqrt{d n\mathrm{polylog}(n) }\right)$ with respect to an oracle that always plays a multiple of $\mathbb{E}\theta_t$. The main challenge in this setting is to ensure that the exploration-exploitation strategy is robust against long-range dependencies. The proposed method relies on Berbee's coupling lemma to carefully select near-independent samples and construct confidence ellipsoids around empirical estimates of $\mathbb{E}\theta_t$.
翻译:本文考虑线性赌博机问题的一个更一般化的表述,允许其具有时间上的依赖性。具体而言,假设存在一个未知的$\mathbb{R}^d$值平稳$\varphi$-混合参数序列$(\theta_t,~t \in \mathbb{N})$,该序列产生收益。该问题实例可视为经典独立同分布噪声线性赌博机与有限臂动态赌博机两者的推广。鉴于动态赌博机最优策略已知的计算复杂性难题,本文提出一种近似方法,并证明其误差受相邻$\theta_t$之间的$\varphi$-依赖性控制。针对$\theta_t$具有指数混合速率的情形,提出一种乐观算法LinMix-UCB。理论证明该算法相对于始终采用$\mathbb{E}\theta_t$整数倍策略的预言机,其遗憾值呈亚线性增长$\mathcal{O}\left(\sqrt{d n\mathrm{polylog}(n) }\right)$。本场景的主要挑战在于确保探索-利用策略对长程依赖性具有鲁棒性。所提方法借助Berbee耦合引理精心选择近似独立样本,并围绕$\mathbb{E}\theta_t$的经验估计构造置信椭球。