Existing approaches to fairness in stochastic multi-armed bandits (MAB) primarily focus on exposure guarantee to individual arms. When arms are naturally grouped by certain attribute(s), we propose Bi-Level Fairness, which considers two levels of fairness. At the first level, Bi-Level Fairness guarantees a certain minimum exposure to each group. To address the unbalanced allocation of pulls to individual arms within a group, we consider meritocratic fairness at the second level, which ensures that each arm is pulled according to its merit within the group. Our work shows that we can adapt a UCB-based algorithm to achieve a Bi-Level Fairness by providing (i) anytime Group Exposure Fairness guarantees and (ii) ensuring individual-level Meritocratic Fairness within each group. We first show that one can decompose regret bounds into two components: (a) regret due to anytime group exposure fairness and (b) regret due to meritocratic fairness within each group. Our proposed algorithm BF-UCB balances these two regrets optimally to achieve the upper bound of $O(\sqrt{T})$ on regret; $T$ being the stopping time. With the help of simulated experiments, we further show that BF-UCB achieves sub-linear regret; provides better group and individual exposure guarantees compared to existing algorithms; and does not result in a significant drop in reward with respect to UCB algorithm, which does not impose any fairness constraint.
翻译:现有随机多臂赌博机(MAB)中的公平性方法主要侧重于单个臂的曝光保障。当臂按某些属性自然分组时,我们提出双层公平性(Bi-Level Fairness),该框架考虑两个层次的公平:第一层确保每个群体获得最低曝光保障;第二层针对群体内部臂的拉取分配不均衡问题,引入精英公平(Meritocratic Fairness),确保每个臂根据其在群体内的能力获得相应拉取次数。本文证明,通过提供(i)任意时刻的群体曝光公平保障和(ii)群体内部的个体级精英公平,我们可将基于UCB的算法改造为满足双层公平性。首先证明遗憾界限可分解为两部分:(a)任意时刻群体曝光公平导致的遗憾,以及(b)群体内部精英公平导致的遗憾。提出的BF-UCB算法通过最优平衡两类遗憾,实现了$O(\sqrt{T})$的遗憾上界($T$为停止时间)。模拟实验进一步表明,BF-UCB算法具有次线性遗憾,相比现有算法能提供更优的群体与个体曝光保障,且相对于未施加公平约束的UCB算法,其奖励损失并不显著。