We introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. With the data-dependent perturbations we propose, not present in previous PHE-type methods, EVILL is shown to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside of generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code.
翻译:我们提出了一种名为线性损失扰动探索(EVILL)的随机化探索方法,该方法面向结构化随机赌博机问题,通过求解线性扰动正则化负对数似然函数的最小化来实现。研究表明,对于广义线性赌博机,EVILL简化为扰动历史探索(PHE)方法——一种通过用随机扰动奖励进行训练来实现探索的方法。由此,我们为随机奖励扰动能在何时以及为何产生优质赌博机算法提供了简洁而清晰的解释。通过引入数据依赖型扰动(这是现有PHE类方法所不具备的),EVILL在理论和实践上均达到了汤普森采样型参数扰动方法的性能。此外,我们在广义线性赌博机之外展示了实例:当PHE导致不一致估计(从而产生线性遗憾)时,EVILL仍能保持良好性能。与PHE相似,EVILL仅需数行代码即可实现。