Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

We study the trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit problem. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to characterize the optimal regret tail probability for any regret threshold. Concretely, for any given $\alpha\in[1/2, 1)$ and $\beta\in[0, \alpha]$, our policy achieves a worst-case expected regret of $\tilde O(T^\alpha)$ (we call it $\alpha$-optimal) and an instance-dependent expected regret of $\tilde O(T^\beta)$ (we call it $\beta$-consistent), while enjoys a probability of incurring an $\tilde O(T^\delta)$ regret ($\delta\geq\alpha$ in the worst-case scenario and $\delta\geq\beta$ in the instance-dependent scenario) that decays exponentially with a polynomial $T$ term. Such decaying rate is proved to be best achievable. Moreover, we discover an intrinsic gap of the optimal tail rate under the instance-dependent scenario between whether the time horizon $T$ is known a priori or not. Interestingly, when it comes to the worst-case scenario, this gap disappears. Finally, we extend our proposed policy design to (1) a stochastic multi-armed bandit setting with non-stationary baseline rewards, and (2) a stochastic linear bandit setting. Our results reveal insights on the trade-off between regret expectation and regret tail risk for both worst-case and instance-dependent scenarios, indicating that more sub-optimality and inconsistency leave space for more light-tailed risk of incurring a large regret, and that knowing the planning horizon in advance can make a difference on alleviating tail risks.

翻译：我们研究了随机多臂赌博机问题中遗憾分布在期望与尾部风险之间的权衡关系。我们完整刻画了策略设计所需的三项理想性质之间的相互作用：最坏情形最优性、实例依赖一致性和轻尾风险。我们展示了在最坏情形与实例依赖场景下，期望遗憾的阶数如何精确影响遗憾尾概率的衰减速率。针对任意遗憾阈值，我们提出一种新策略来刻画最优遗憾尾概率。具体而言，对于任意给定的$\alpha\in[1/2, 1)$和$\beta\in[0, \alpha]$，该策略可实现最坏情形期望遗憾为$\tilde O(T^\alpha)$（称为$\alpha$-最优）和实例依赖期望遗憾为$\tilde O(T^\beta)$（称为$\beta$-一致性），同时能够以与$T$的多项式项指数衰减的概率，获得$\tilde O(T^\delta)$的遗憾（最坏情形下$\delta\geq\alpha$，实例依赖情形下$\delta\geq\beta$）。该衰减速率被证明是最优可达的。此外，我们发现在实例依赖场景中，关于时间范围$T$是否预先已知，最优尾部速率存在本质差异。有趣的是，在最坏情形场景中这一差异消失。最后，我们将所提出的策略设计拓展至：（1）具有非平稳基准奖励的随机多臂赌博机设定；（2）随机线性赌博机设定。我们的研究结果揭示了在最坏情形与实例依赖场景中遗憾期望与遗憾尾部风险之间的权衡规律，表明更高的次优性与不一致性会为承受大遗憾事件的轻尾风险留出空间，且预先知道规划时域有助于缓解尾部风险。