Classic no-regret online prediction algorithms, including variants of the Upper Confidence Bound ($\texttt{UCB}$) algorithm, $\texttt{Hedge}$, and $\texttt{EXP3}$, are inherently unfair by design. The unfairness stems from their very objective of playing the most rewarding arm as many times as possible while ignoring the less rewarding ones among $N$ arms. In this paper, we consider a fair prediction problem in the stochastic setting with hard lower bounds on the rate of accrual of rewards for a set of arms. We study the problem in both full and bandit feedback settings. Using queueing-theoretic techniques in conjunction with adversarial learning, we propose a new online prediction policy called $\texttt{BanditQ}$ that achieves the target reward rates while achieving a regret and target rate violation penalty of $O(T^{\frac{3}{4}}).$ In the full-information setting, the regret bound can be further improved to $O(\sqrt{T})$ when considering the average regret over the entire horizon of length $T$. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard MAB problem with a carefully defined sequence of rewards. The design and analysis of the $\texttt{BanditQ}$ policy involve a novel use of the potential function method in conjunction with scale-free second-order regret bounds and a new self-bounding inequality for the reward gradients, which are of independent interest.
翻译:经典的无遗憾在线预测算法(包括置信上界算法($\texttt{UCB}$)、$\texttt{Hedge}$ 和 $\texttt{EXP3}$ 的变体)本质上是不公平的。这种不公平源于其核心目标:尽可能多地选择$N$个臂中奖励最高的臂,而忽略奖励较低的臂。本文研究了随机环境下具有各臂奖励累积速率硬下限的公平预测问题,并在完全反馈和赌博机反馈两种设定下进行分析。通过将排队论技术与对抗学习相结合,我们提出了一种名为$\texttt{BanditQ}$的新在线预测策略,该策略在实现目标奖励速率的同时,其遗憾值与目标速率违反惩罚均为$O(T^{\frac{3}{4}})$。在完全信息设定下,若考虑长度为$T$的整个时间范围内的平均遗憾,该界可进一步改进为$O(\sqrt{T})$。所提策略具有高效性,并允许通过精心定义的奖励序列,将公平预测问题通过黑盒约简转化为标准多臂老虎机问题。$\texttt{BanditQ}$策略的设计与分析涉及势函数方法、无尺度二阶遗憾界以及针对奖励梯度的新自界不等式的创新性结合,这些技术方法本身亦具有独立的研究价值。