We investigate the regret-minimisation problem in a multi-armed bandit setting with arbitrary corruptions. Similar to the classical setup, the agent receives rewards generated independently from the distribution of the arm chosen at each time. However, these rewards are not directly observed. Instead, with a fixed $\varepsilon\in (0,\frac{1}{2})$, the agent observes a sample from the chosen arm's distribution with probability $1-\varepsilon$, or from an arbitrary corruption distribution with probability $\varepsilon$. Importantly, we impose no assumptions on these corruption distributions, which can be unbounded. In this setting, accommodating potentially unbounded corruptions, we establish a problem-dependent lower bound on regret for a given family of arm distributions. We introduce CRIMED, an asymptotically-optimal algorithm that achieves the exact lower bound on regret for bandits with Gaussian distributions with known variance. Additionally, we provide a finite-sample analysis of CRIMED's regret performance. Notably, CRIMED can effectively handle corruptions with $\varepsilon$ values as high as $\frac{1}{2}$. Furthermore, we develop a tight concentration result for medians in the presence of arbitrary corruptions, even with $\varepsilon$ values up to $\frac{1}{2}$, which may be of independent interest. We also discuss an extension of the algorithm for handling misspecification in Gaussian model.
翻译:我们研究具有任意腐败的多臂赌博机环境中的遗憾最小化问题。与经典设定类似,智能体在每个时刻独立地从所选臂的分布中生成奖励。然而,这些奖励并非直接观测。相反,在固定 $\varepsilon\in (0,\frac{1}{2})$ 的情况下,智能体以概率 $1-\varepsilon$ 观测到所选臂分布的一个样本,或以概率 $\varepsilon$ 观测到来自任意腐败分布的一个样本。重要的是,我们对这些腐败分布不作任何假设,这些分布可以是无界的。在此设定下,考虑到潜在的无界腐败,我们针对给定臂分布族建立了问题相关的遗憾下界。我们提出了CRIMED,一种渐近最优算法,该算法在已知方差的高斯分布赌博机中达到精确的遗憾下界。此外,我们提供了CRIMED遗憾性能的有限样本分析。值得注意的是,CRIMED能够有效处理 $\varepsilon$ 值高达 $\frac{1}{2}$ 的腐败。我们还针对任意腐败(即使 $\varepsilon$ 值高达 $\frac{1}{2}$)建立了中位数的紧致集中性结果,该结果可能具有独立的研究价值。最后,我们讨论了该算法在处理高斯模型设定错误情况下的扩展。