In this work, we extend the concept of the $p$-mean welfare objective from social choice theory (Moulin 2004) to study $p$-mean regret in stochastic multi-armed bandit problems. The $p$-mean regret, defined as the difference between the optimal mean among the arms and the $p$-mean of the expected rewards, offers a flexible framework for evaluating bandit algorithms, enabling algorithm designers to balance fairness and efficiency by adjusting the parameter $p$. Our framework encompasses both average cumulative regret and Nash regret as special cases. We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that achieves novel $p$-mean regret bounds. Our algorithm consists of two phases: a carefully calibrated uniform exploration phase to initialize sample means, followed by the UCB1 algorithm of Auer, Cesa-Bianchi, and Fischer (2002). Under mild assumptions, we prove that our algorithm achieves a $p$-mean regret bound of $\tilde{O}\left(\sqrt{\frac{k}{T^{\frac{1}{2|p|}}}}\right)$ for all $p \leq -1$, where $k$ represents the number of arms and $T$ the time horizon. When $-1<p<0$, we achieve a regret bound of $\tilde{O}\left(\sqrt{\frac{k^{1.5}}{T^{\frac{1}{2}}}}\right)$. For the range $0< p \leq 1$, we achieve a $p$-mean regret scaling as $\tilde{O}\left(\sqrt{\frac{k}{T}}\right)$, which matches the previously established lower bound up to logarithmic factors (Auer et al. 1995). This result stems from the fact that the $p$-mean regret of any algorithm is at least its average cumulative regret for $p \leq 1$. In the case of Nash regret (the limit as $p$ approaches zero), our unified approach differs from prior work (Barman et al. 2023), which requires a new Nash Confidence Bound algorithm. Notably, we achieve the same regret bound up to constant factors using our more general method.
翻译:本研究将社会选择理论中的$p$-均值福利目标概念推广至随机多臂老虎机问题中的$p$-均值遗憾分析。$p$-均值遗憾定义为各臂最优期望收益与期望收益$p$-均值之间的差值,为评估老虎机算法提供了一个灵活的框架,使算法设计者能够通过调整参数$p$来平衡公平性与效率。我们的框架将平均累积遗憾和纳什遗憾均涵盖为特例。我们提出了一种基于UCB的简洁统一算法(探索后UCB算法),该算法实现了新颖的$p$-均值遗憾上界。该算法包含两个阶段:经过精细校准的均匀探索阶段以初始化样本均值,随后采用Auer、Cesa-Bianchi和Fischer提出的UCB1算法。在温和假设下,我们证明该算法对所有$p \leq -1$实现了$\tilde{O}\left(\sqrt{\frac{k}{T^{\frac{1}{2|p|}}}}\right)$的$p$-均值遗憾上界,其中$k$表示臂的数量,$T$表示时间跨度。当$-1<p<0$时,我们获得$\tilde{O}\left(\sqrt{\frac{k^{1.5}}{T^{\frac{1}{2}}}}\right)$的遗憾上界。对于$0< p \leq 1$的范围,我们实现$\tilde{O}\left(\sqrt{\frac{k}{T}}\right)$量级的$p$-均值遗憾,该结果与已有下界在忽略对数因子后相匹配。这一结论源于以下事实:任何算法的$p$-均值遗憾在$p \leq 1$时至少为其平均累积遗憾。对于纳什遗憾情形,我们的统一方法与此前需要专门设计纳什置信界算法的工作不同,通过更通用的方法在常数因子范围内达到了相同的遗憾上界。