We introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code.
翻译:我们提出了一种通过线性损失扰动进行探索的方法(EVILL),这是一种针对结构化随机赌博机问题的随机化探索方法,其核心在于求解一个线性扰动的正则化负对数似然函数的最小化器。研究表明,对于广义线性赌博机问题,EVILL 可简化为扰动历史探索(PHE)方法——一种通过使用随机扰动奖励进行训练来实现探索的方法。借此,我们为随机奖励扰动何时以及为何能催生优良的赌博机算法提供了简洁清晰的解释。我们提出了此前PHE类方法中未包含的数据依赖型扰动,使得EVILL在理论和实践中均能匹配汤普森采样风格参数扰动方法的性能。此外,我们展示了广义线性赌博机之外的一个实例:在该实例中,PHE会导致估计不一致并引发线性遗憾,而EVILL仍能保持高效性能。与PHE类似,EVILL仅需寥寥数行代码即可实现。