This paper considers a stochastic Multi-Armed Bandit (MAB) problem with dual objectives: (i) quick identification and commitment to the optimal arm, and (ii) reward maximization throughout a sequence of $T$ consecutive rounds. Though each objective has been individually well-studied, i.e., best arm identification for (i) and regret minimization for (ii), the simultaneous realization of both objectives remains an open problem, despite its practical importance. This paper introduces \emph{Regret Optimal Best Arm Identification} (ROBAI) which aims to achieve these dual objectives. To solve ROBAI with both pre-determined stopping time and adaptive stopping time requirements, we present an algorithm called EOCP and its variants respectively, which not only achieve asymptotic optimal regret in both Gaussian and general bandits, but also commit to the optimal arm in $\mathcal{O}(\log T)$ rounds with pre-determined stopping time and $\mathcal{O}(\log^2 T)$ rounds with adaptive stopping time. We further characterize lower bounds on the commitment time (equivalent to the sample complexity) of ROBAI, showing that EOCP and its variants are sample optimal with pre-determined stopping time, and almost sample optimal with adaptive stopping time. Numerical results confirm our theoretical analysis and reveal an interesting "over-exploration" phenomenon carried by classic UCB algorithms, such that EOCP has smaller regret even though it stops exploration much earlier than UCB, i.e., $\mathcal{O}(\log T)$ versus $\mathcal{O}(T)$, which suggests over-exploration is unnecessary and potentially harmful to system performance.
翻译:本文考虑一个具有双重目标的随机多臂赌博机问题:(i) 快速识别并确定最优臂;(ii) 在$T$个连续回合中最大化累积奖励。尽管每个目标都已被单独深入研究,即针对(i)的最佳臂识别与针对(ii)的遗憾最小化,但这两个目标的同步实现仍是一个开放问题,尽管其具有实际重要性。本文引入了“遗憾最优最佳臂识别”(ROBAI),旨在实现这两重目标。为解决具有预定停止时间和自适应停止时间需求的ROBAI问题,我们分别提出了名为EOCP及其变体的算法,这些算法不仅在高斯型赌博机和通用型赌博机中实现了渐近最优的遗憾,而且能在预定停止时间下在$\mathcal{O}(\log T)$轮内、在自适应停止时间下在$\mathcal{O}(\log^2 T)$轮内确定最优臂。我们进一步刻画了ROBAI的确定时间(相当于样本复杂度)的下界,表明EOCP及其变体在预定停止时间下是样本最优的,在自适应停止时间下是几乎样本最优的。数值结果证实了我们的理论分析,并揭示了经典UCB算法所携带的一种有趣的“过度探索”现象:尽管EOCP比UCB更早停止探索(即$\mathcal{O}(\log T)$对比$\mathcal{O}(T)$),但其遗憾值更小,这表明过度探索是不必要的,并且可能对系统性能有害。