A/B Testing and Best-arm Identification for Linear Bandits with Robustness to Non-stationarity

We investigate the fixed-budget best-arm identification (BAI) problem for linear bandits in a potentially non-stationary environment. Given a finite arm set $\mathcal{X}\subset\mathbb{R}^d$, a fixed budget $T$, and an unpredictable sequence of parameters $\left\lbrace\theta_t\right\rbrace_{t=1}^{T}$, an algorithm will aim to correctly identify the best arm $x^* := \arg\max_{x\in\mathcal{X}}x^\top\sum_{t=1}^{T}\theta_t$ with probability as high as possible. Prior work has addressed the stationary setting where $\theta_t = \theta_1$ for all $t$ and demonstrated that the error probability decreases as $\exp(-T /\rho^*)$ for a problem-dependent constant $\rho^*$. But in many real-world $A/B/n$ multivariate testing scenarios that motivate our work, the environment is non-stationary and an algorithm expecting a stationary setting can easily fail. For robust identification, it is well-known that if arms are chosen randomly and non-adaptively from a G-optimal design over $\mathcal{X}$ at each time then the error probability decreases as $\exp(-T\Delta^2_{(1)}/d)$, where $\Delta_{(1)} = \min_{x \neq x^*} (x^* - x)^\top \frac{1}{T}\sum_{t=1}^T \theta_t$. As there exist environments where $\Delta_{(1)}^2/ d \ll 1/ \rho^*$, we are motivated to propose a novel algorithm $\mathsf{P1}$-$\mathsf{RAGE}$ that aims to obtain the best of both worlds: robustness to non-stationarity and fast rates of identification in benign settings. We characterize the error probability of $\mathsf{P1}$-$\mathsf{RAGE}$ and demonstrate empirically that the algorithm indeed never performs worse than G-optimal design but compares favorably to the best algorithms in the stationary setting.

翻译：我们研究了在潜在非平稳环境下线性臂赌博机的固定预算最佳臂识别问题。给定有限臂集$\mathcal{X}\subset\mathbb{R}^d$、固定预算$T$以及不可预测的参数序列$\left\lbrace\theta_t\right\rbrace_{t=1}^{T}$，算法旨在以尽可能高的概率正确识别最佳臂$x^* := \arg\max_{x\in\mathcal{X}}x^\top\sum_{t=1}^{T}\theta_t$。先前工作解决了所有时刻$\theta_t = \theta_1$的平稳场景，并证明误差概率以问题相关常数$\rho^*$的$\exp(-T /\rho^*)$速率递减。但在许多现实世界A/B/n多变量测试场景（本研究的动机）中，环境是非平稳的，期望平稳场景的算法极易失效。为获得鲁棒识别，已知若每次从$\mathcal{X}$的G-最优设计中随机非自适应选择臂，则误差概率以$\exp(-T\Delta^2_{(1)}/d)$递减，其中$\Delta_{(1)} = \min_{x \neq x^*} (x^* - x)^\top \frac{1}{T}\sum_{t=1}^T \theta_t$。由于存在环境使得$\Delta_{(1)}^2/ d \ll 1/ \rho^*$，我们提出新算法$\mathsf{P1}$-$\mathsf{RAGE}$，旨在同时实现两方面优势：对非平稳性的鲁棒性以及在良性场景下的快速识别速率。我们刻画了$\mathsf{P1}$-$\mathsf{RAGE}$的误差概率，并通过实验证明该算法确实不会比G-最优设计表现更差，同时在平稳场景中与最优算法性能相当。