Modern systems, such as digital platforms and service systems, increasingly rely on contextual bandits for online decision-making; however, their deployment can inadvertently create unfair exposure among arms, undermining long-term platform sustainability and supplier trust. This paper studies the contextual bandit problem under a uniform $(1-δ)$-fairness constraint, and addresses its unique vulnerabilities to strategic manipulation. The fairness constraint ensures that preferential treatment is strictly justified by an arm's actual reward across all contexts and time horizons, using uniformity to prevent statistical loopholes. We develop novel algorithms that achieve (nearly) minimax-optimal regret for both linear and smooth reward functions, while maintaining strong $(1-\tilde{O}(1/T))$-fairness guarantees, and further characterize the theoretically inherent yet asymptotically marginal "price of fairness". However, we reveal that such merit-based fairness becomes uniquely susceptible to signal manipulation. We show that an adversary with a minimal $\tilde{O}(1)$ budget can not only degrade overall performance as in traditional attacks, but also selectively induce insidious fairness-specific failures while leaving conspicuous regret measures largely unaffected. To counter this, we design robust variants incorporating corruption-adaptive exploration and error-compensated thresholding. Our approach yields the first minimax-optimal regret bounds under $C$-budgeted attack while preserving $(1-\tilde{O}(1/T))$-fairness. Numerical experiments and a real-world case demonstrate that our algorithms sustain both fairness and efficiency.
翻译:现代系统,如数字平台和服务系统,日益依赖上下文老虎机进行在线决策;然而,其部署可能无意中在臂(选项)之间造成不公平的曝光,损害平台的长期可持续性和供应商信任。本文研究了在均匀$(1-δ)$-公平性约束下的上下文老虎机问题,并解决了其面对策略性操纵时特有的脆弱性。该公平性约束通过均匀性来防止统计漏洞,确保在所有上下文和时间范围内,对某个臂的优待严格基于其实际奖励。我们为线性和平滑奖励函数开发了新颖的算法,这些算法实现了(近似)极小极大最优的遗憾,同时保持了强$(1-\tilde{O}(1/T))$-公平性保证,并进一步刻画了理论上固有但渐近边际的"公平性代价"。然而,我们发现这种基于价值的公平性变得特别容易受到信号操纵。我们证明,一个仅需$\tilde{O}(1)$预算的对手不仅能够像传统攻击那样降低整体性能,还能选择性地引发隐蔽的、公平性特有的故障,同时使显性的遗憾度量基本不受影响。为了应对此问题,我们设计了鲁棒的变体算法,结合了腐败自适应探索和误差补偿阈值化。我们的方法在$C$预算攻击下,首次实现了保持$(1-\tilde{O}(1/T))$-公平性的极小极大最优遗憾界。数值实验和一个真实世界案例表明,我们的算法能够同时维持公平性和效率。