We obtain essentially tight upper bounds for a strengthened notion of regret in the stochastic linear bandits framework. The strengthening -- referred to as Nash regret -- is defined as the difference between the (a priori unknown) optimum and the geometric mean of expected rewards accumulated by the linear bandit algorithm. Since the geometric mean corresponds to the well-studied Nash social welfare (NSW) function, this formulation quantifies the performance of a bandit algorithm as the collective welfare it generates across rounds. NSW is known to satisfy fairness axioms and, hence, an upper bound on Nash regret provides a principled fairness guarantee. We consider the stochastic linear bandits problem over a horizon of $T$ rounds and with set of arms ${X}$ in ambient dimension $d$. Furthermore, we focus on settings in which the stochastic reward -- associated with each arm in ${X}$ -- is a non-negative, $\nu$-sub-Poisson random variable. For this setting, we develop an algorithm that achieves a Nash regret of $O\left( \sqrt{\frac{d\nu}{T}} \log( T |X|)\right)$. In addition, addressing linear bandit instances in which the set of arms ${X}$ is not necessarily finite, we obtain a Nash regret upper bound of $O\left( \frac{d^\frac{5}{4}\nu^{\frac{1}{2}}}{\sqrt{T}} \log(T)\right)$. Since bounded random variables are sub-Poisson, these results hold for bounded, positive rewards. Our linear bandit algorithm is built upon the successive elimination method with novel technical insights, including tailored concentration bounds and the use of sampling via John ellipsoid in conjunction with the Kiefer-Wolfowitz optimal design.
翻译:在线性随机老虎机框架中,我们针对一种强化的遗憾概念获得了本质上的紧上界。这种强化——称为纳什遗憾——定义为(先验未知的)最优值与线性老虎机算法累积期望奖励的几何均值之差。由于几何均值对应于经过充分研究的纳什社会福利函数,此公式将老虎机算法的性能量化为其跨回合产生的集体福利。已知纳什社会福利满足公平性公理,因此纳什遗憾的上界提供了原则性的公平性保证。我们考虑在$T$轮博弈周期内、且臂集${X}$处于环境维度$d$中的线性随机老虎机问题。此外,我们聚焦于每个臂${X}$对应的随机奖励为非负$\nu$-次泊松随机变量的设定。针对此设定,我们提出一种算法,其纳什遗憾为$O\left( \sqrt{\frac{d\nu}{T}} \log( T |X|)\right)$。同时,针对臂集${X}$未必有限的线性老虎机实例,我们获得了$O\left( \frac{d^\frac{5}{4}\nu^{\frac{1}{2}}}{\sqrt{T}} \log(T)\right)$的纳什遗憾上界。由于有界随机变量属于次泊松分布,这些结果适用于有界正奖励。我们的线性老虎机算法基于逐次淘汰方法,并融入了新颖的技术见解,包括定制化的浓度界,以及结合基弗-沃尔福威茨最优设计、利用约翰椭球进行采样的方法。