Classic online prediction algorithms, such as Hedge, are inherently unfair by design, as they try to play the most rewarding arm as many times as possible while ignoring the sub-optimal arms to achieve sublinear regret. In this paper, we consider a fair online prediction problem in the adversarial setting with hard lower bounds on the rate of accrual of rewards for all arms. By combining elementary queueing theory with online learning, we propose a new online prediction policy, called BanditQ, that achieves the target rate constraints while achieving a regret of $O(T^{3/4})$ in the full-information setting. The design and analysis of BanditQ involve a novel use of the potential function method and are of independent interest.
翻译:经典在线预测算法(如Hedge)在设计上固有地存在不公平性,因为它们倾向于尽可能多地选择收益最高的臂,同时忽略次优臂以实现次线性遗憾。本文考虑对抗性设置下的公平在线预测问题,其中所有臂的奖励累积速率均需满足严格的下界约束。通过将基本排队理论与在线学习相结合,我们提出了一种名为BanditQ的新型在线预测策略,该策略在完全信息设置下以$O(T^{3/4})$的遗憾值实现目标速率约束。BanditQ的设计与分析涉及势函数方法的创新应用,这本身具有独立的研究价值。