How Does Variance Shape the Regret in Contextual Bandits?

We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax bounds, we show that the eluder dimension $d_\text{elu}$$-$a complexity measure of the function class$-$plays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner's action. In this setting, we prove that a regret of $\Omega(\sqrt{\min\{A,d_\text{elu}\}\Lambda}+d_\text{elu})$ is unavoidable when $d_{\text{elu}}\leq\sqrt{AT}$, where $A$ is the number of actions, $T$ is the total number of rounds, and $\Lambda$ is the total variance over $T$ rounds. For the $A\leq d_\text{elu}$ regime, we derive a nearly matching upper bound $\tilde{O}(\sqrt{A\Lambda}+d_\text{elu})$ for the special case where the variance is revealed at the beginning of each round. (2) Strong adversary: The adversary sets the reward variance after observing the learner's action. We show that a regret of $\Omega(\sqrt{d_\text{elu}\Lambda}+d_\text{elu})$ is unavoidable when $\sqrt{d_\text{elu}\Lambda}+d_\text{elu}\leq\sqrt{AT}$. In this setting, we provide an upper bound of order $\tilde{O}(d_\text{elu}\sqrt{\Lambda}+d_\text{elu})$. Furthermore, we examine the setting where the function class additionally provides distributional information of the reward, as studied by Wang et al. (2024). We demonstrate that the regret bound $\tilde{O}(\sqrt{d_\text{elu}\Lambda}+d_\text{elu})$ established in their work is unimprovable when $\sqrt{d_{\text{elu}}\Lambda}+d_\text{elu}\leq\sqrt{AT}$. However, with a slightly different definition of the total variance and with the assumption that the reward follows a Gaussian distribution, one can achieve a regret of $\tilde{O}(\sqrt{A\Lambda}+d_\text{elu})$.

翻译：我们考虑具有一般函数逼近能力的可实现上下文赌博机，研究较小的奖励方差如何导致优于极小极大遗憾界的结果。与极小极大界不同，我们表明eluder维度$d_\text{elu}$$-$函数类的一个复杂度度量$-$在方差依赖的界中起着关键作用。我们考虑两种类型的对手：(1) 弱对手：对手在观察到学习者的动作之前设定奖励方差。在此设置中，我们证明当$d_{\text{elu}}\leq\sqrt{AT}$时，$\Omega(\sqrt{\min\{A,d_\text{elu}\}\Lambda}+d_\text{elu})$的遗憾是不可避免的，其中$A$是动作数量，$T$是总轮数，$\Lambda$是$T$轮中的总方差。对于$A\leq d_\text{elu}$的情况，我们针对每轮开始时方差被揭示的特殊情况，推导出一个近乎匹配的上界$\tilde{O}(\sqrt{A\Lambda}+d_\text{elu})$。(2) 强对手：对手在观察到学习者的动作之后设定奖励方差。我们证明当$\sqrt{d_\text{elu}\Lambda}+d_\text{elu}\leq\sqrt{AT}$时，$\Omega(\sqrt{d_\text{elu}\Lambda}+d_\text{elu})$的遗憾是不可避免的。在此设置中，我们提供了一个量级为$\tilde{O}(d_\text{elu}\sqrt{\Lambda}+d_\text{elu})$的上界。此外，我们考察了函数类额外提供奖励分布信息的情况，如Wang等人(2024)所研究。我们证明，当$\sqrt{d_{\text{elu}}\Lambda}+d_\text{elu}\leq\sqrt{AT}$时，他们在工作中建立的遗憾界$\tilde{O}(\sqrt{d_\text{elu}\Lambda}+d_\text{elu})$是不可改进的。然而，通过略微不同的总方差定义，并假设奖励服从高斯分布，可以实现$\tilde{O}(\sqrt{A\Lambda}+d_\text{elu})$的遗憾。