By leveraging the representation power of deep neural networks, neural upper confidence bound (UCB) algorithms have shown success in contextual bandits. To further balance the exploration and exploitation, we propose Neural-$\sigma^2$-LinearUCB, a variance-aware algorithm that utilizes $\sigma^2_t$, i.e., an upper bound of the reward noise variance at round $t$, to enhance the uncertainty quantification quality of the UCB, resulting in a regret performance improvement. We provide an oracle version for our algorithm characterized by an oracle variance upper bound $\sigma^2_t$ and a practical version with a novel estimation for this variance bound. Theoretically, we provide rigorous regret analysis for both versions and prove that our oracle algorithm achieves a better regret guarantee than other neural-UCB algorithms in the neural contextual bandits setting. Empirically, our practical method enjoys a similar computational efficiency, while outperforming state-of-the-art techniques by having a better calibration and lower regret across multiple standard settings, including on the synthetic, UCI, MNIST, and CIFAR-10 datasets.
翻译:通过利用深度神经网络的表示能力,神经上置信界(UCB)算法在上下文赌博机中已展现出成功应用。为进一步平衡探索与利用,我们提出Neural-$\sigma^2$-LinearUCB算法——一种方差感知算法,该算法利用第$t$轮奖励噪声方差上界$\sigma^2_t$来增强UCB的不确定性量化质量,从而提升遗憾性能表现。我们为算法提供了具有理论方差上界$\sigma^2_t$的预言机版本,以及采用新型方差界估计方法的实用版本。在理论层面,我们对两个版本均进行了严格的遗憾分析,并证明在神经上下文赌博机设定下,我们的预言机算法相比其他神经UCB算法具有更优的遗憾保证。在实证层面,我们的实用方法在保持相近计算效率的同时,通过在合成数据集、UCI数据集、MNIST数据集和CIFAR-10数据集等多个标准设定中展现出更优的校准效果和更低的遗憾值,性能超越了现有技术。