We introduce the first variance-aware algorithms for contextual dueling bandits that leverage shallow exploration strategies with neural networks for nonlinear utility approximation. A key theoretical challenge is the absence of a closed-form estimator, which led prior work to require an extremely large network width $m$ (i.e., $m = \widetildeΩ(T^{14})$). We address this constraint with a novel analytical approach that combines iterative self-improvement with spectral analysis. Our analysis significantly reduces the network width requirement to $m = \widetildeΩ(T^{6})$, and shows that our algorithms achieve a sublinear regret of $\widetilde{\mathcal{O}}(d\sqrt{\sum_{t=1}^{T} σ_t^2} + \sqrt{dT})$ under both UCB and TS frameworks. Empirical results show that the proposed algorithms are not only computationally efficient and exhibit sublinear regret in practical settings, but also achieve state-of-the-art performance on both synthetic and real-world tasks.
翻译:我们提出了首个针对上下文对抗多臂赌博机的方差感知算法,该算法利用浅层探索策略结合神经网络进行非线性效用逼近。一个关键的理论挑战在于缺乏闭式估计量,这导致先前工作需要极宽的神经网络宽度$m$(即$m = \widetildeΩ(T^{14})$)。我们通过一种结合迭代自改进与谱分析的新型分析方法解决了这一约束。我们的分析将网络宽度要求显著降低至$m = \widetildeΩ(T^{6})$,并证明在UCB和TS框架下,所提算法可达到$\widetilde{\mathcal{O}}(d\sqrt{\sum_{t=1}^{T} σ_t^2} + \sqrt{dT})$的次线性遗憾。实验结果表明,所提算法不仅计算高效且在实际场景中呈现次线性遗憾,同时在合成任务与真实任务中均达到当前最优性能。