上下文老虎机的方差相关遗憾下界 (Variance-Dependent Regret Lower Bounds for Contextual Bandits)

Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical $\tilde{O}(d\sqrt{K})$ regret bound to $\tilde{O}(d\sqrt{\sum_{k=1}^K\sigma_k^2})$, where $d$ is the context dimension, $K$ is the number of rounds, and $\sigma^2_k$ is the noise variance in round $k$, has been widely studied in recent years. However, most existing works focus on the regret upper bounds instead of lower bounds. To our knowledge, the only lower bound is from Jia et al. (2024), which proved that for any eluder dimension $d_{\textbf{elu}}$ and total variance budget $\Lambda$, there exists an instance with $\sum_{k=1}^K\sigma_k^2\leq \Lambda$ for which any algorithm incurs a variance-dependent lower bound of $\Omega(\sqrt{d_{\textbf{elu}}\Lambda})$. However, this lower bound has a $\sqrt{d}$ gap with existing upper bounds. Moreover, it only considers a fixed total variance budget $\Lambda$ and does not apply to a general variance sequence $\{\sigma_1^2,\ldots,\sigma_K^2\}$. In this paper, to overcome the limitations of Jia et al. (2024), we consider the general variance sequence under two settings. For a prefixed sequence, where the entire variance sequence is revealed to the learner at the beginning of the learning process, we establish a variance-dependent lower bound of $\Omega(d \sqrt{\sum_{k=1}^K\sigma_k^2 }/\log K)$ for linear contextual bandits. For an adaptive sequence, where an adversary can generate the variance $\sigma_k^2$ in each round $k$ based on historical observations, we show that when the adversary must generate $\sigma_k^2$ before observing the decision set $\mathcal{D}_k$, a similar lower bound of $\Omega(d\sqrt{ \sum_{k=1}^K\sigma_k^2} /\log^6(dK))$ holds. In both settings, our results match the upper bounds of the SAVE algorithm (Zhao et al., 2023) up to logarithmic factors.

翻译：近年来，线性上下文老虎机的方差相关遗憾界得到了广泛研究，其将经典的 $\tilde{O}(d\sqrt{K})$ 遗憾界改进为 $\tilde{O}(d\sqrt{\sum_{k=1}^K\sigma_k^2})$，其中 $d$ 是上下文维度，$K$ 是回合数，$\sigma^2_k$ 是第 $k$ 回合的噪声方差。然而，现有工作大多关注遗憾上界而非下界。据我们所知，唯一的下界来自 Jia 等人 (2024) 的工作，他们证明了对于任意困惑维度 $d_{\textbf{elu}}$ 和总方差预算 $\Lambda$，存在一个满足 $\sum_{k=1}^K\sigma_k^2\leq \Lambda$ 的实例，使得任何算法都会遭受 $\Omega(\sqrt{d_{\textbf{elu}}\Lambda})$ 的方差相关下界。然而，该下界与现有上界存在 $\sqrt{d}$ 的差距。此外，它仅考虑了固定的总方差预算 $\Lambda$，并不适用于一般的方差序列 $\{\sigma_1^2,\ldots,\sigma_K^2\}$。本文为克服 Jia 等人 (2024) 的局限性，考虑了一般方差序列下的两种设定。对于预定义序列（即整个方差序列在学习过程开始时即告知学习者），我们为线性上下文老虎机建立了 $\Omega(d \sqrt{\sum_{k=1}^K\sigma_k^2 }/\log K)$ 的方差相关下界。对于自适应序列（即对手可以根据历史观测在每回合 $k$ 生成方差 $\sigma_k^2$），我们证明了当对手必须在观测到决策集 $\mathcal{D}_k$ 之前生成 $\sigma_k^2$ 时，类似的 $\Omega(d\sqrt{ \sum_{k=1}^K\sigma_k^2} /\log^6(dK))$ 下界成立。在这两种设定下，我们的结果与 SAVE 算法 (Zhao et al., 2023) 的上界在多项式对数因子内匹配。