In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We refer to this as $V^*$ estimation. It was recently shown that fast $V^*$ estimation is possible but only in disjoint linear bandits with Gaussian covariates. Whether this is possible for more realistic context distributions has remained an open and important question for tasks such as model selection. In this paper, we first provide lower bounds showing that this general problem is hard. However, under stronger assumptions, we give an algorithm and analysis proving that $\widetilde{\mathcal{O}}(\sqrt{d})$ sublinear estimation of $V^*$ is indeed information-theoretically possible, where $d$ is the dimension. We then present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V^*$ that holds for general distributions and is tight when the context distribution is Gaussian. We prove our algorithm requires only $\widetilde{\mathcal{O}}(\sqrt{d})$ samples to estimate the upper bound. We use this upper bound and the estimator to obtain novel and improved guarantees for several applications in bandit model selection and testing for treatment effects.
翻译:在许多赌博机问题中,策略所能实现的最大奖励往往事先未知。我们考虑在最优策略尚可学习之前的次线性数据 regime 中估计最优策略值的问题,并将其称为 $V^*$ 估计。最近的研究表明,快速 $V^*$ 估计是可能的,但仅适用于具有高斯协变量的不相交线性赌博机。对于更真实的情境分布,这一方法是否可行,仍是诸如模型选择等任务中一个开放且重要的问题。在本文中,我们首先给出下界,表明该一般问题具有难度。然而,在更强假设下,我们提供了一种算法及分析,证明 $\widetilde{\mathcal{O}}(\sqrt{d})$ 次线性估计 $V^*$ 在信息论上确实是可能的,其中 $d$ 是维度。接着,我们提出一种更实用、计算高效的算法,该算法估计 $V^*$ 的一个问题依赖上界,该上界适用于一般分布,且当情境分布为高斯分布时是紧致的。我们证明,我们的算法仅需 $\widetilde{\mathcal{O}}(\sqrt{d})$ 个样本即可估计该上界。我们利用此上界及估计量,为赌博机模型选择和因果效应检验中的若干应用提供了新颖且改进的保证。