We study the optimal trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit model. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. New policies are proposed to characterize the optimal regret tail probability for any regret threshold. In particular, we discover an intrinsic gap of the optimal tail rate depending on whether the time horizon $T$ is known a priori or not. Interestingly, when it comes to the purely worst-case scenario, this gap disappears. Our results reveal insights on how to design policies that balance between efficiency and safety, and highlight extra insights on policy robustness with regard to policy hyper-parameters and model mis-specification. We also conduct a simulation study to validate our theoretical insights and provide practical amendment to our policies. Finally, we discuss extensions of our results to (i) general sub-exponential environments and (ii) general stochastic linear bandits. Furthermore, we find that a special case of our policy design surprisingly coincides with what was adopted in AlphaGo Monte Carlo Tree Search. Our theory provides high-level insights to why their engineered solution is successful and should be advocated in complex decision-making environments.
翻译:本研究探讨了随机多臂赌博机模型中遗憾分布在期望与尾部风险之间的最优权衡关系。我们完整刻画了策略设计所需满足的三个理想特性之间的相互作用:最坏情况最优性、实例依赖一致性与轻尾风险特性。本文提出了新的策略框架,用以刻画任意遗憾阈值下的最优遗憾尾部概率。特别地,我们发现在时间范围$T$是否先验已知的不同条件下,最优尾部速率存在本质差异。有趣的是,在纯粹的最坏情形分析中,这种差异会消失。研究结果揭示了如何设计兼顾效率与安全性的策略,并对策略超参数选择与模型误设情形下的策略鲁棒性提供了新的见解。我们通过仿真实验验证了理论发现,并对所提策略进行了实践修正。最后,我们将研究结果拓展至(i)一般次指数环境与(ii)一般随机线性赌博机场景。值得注意的是,我们发现策略设计的某个特例竟与AlphaGo蒙特卡洛树搜索所采用的方法不谋而合。我们的理论从更高层面揭示了其工程化解决方案的成功原因,并论证了该方法在复杂决策环境中的推广价值。