Nonparametric contextual bandit is an important model of sequential decision making problems. Under $\alpha$-Tsybakov margin condition, existing research has established a regret bound of $\tilde{O}\left(T^{1-\frac{\alpha+1}{d+2}}\right)$ for bounded supports. However, the optimal regret with unbounded contexts has not been analyzed. The challenge of solving contextual bandit problems with unbounded support is to achieve both exploration-exploitation tradeoff and bias-variance tradeoff simultaneously. In this paper, we solve the nonparametric contextual bandit problem with unbounded contexts. We propose two nearest neighbor methods combined with UCB exploration. The first method uses a fixed $k$. Our analysis shows that this method achieves minimax optimal regret under a weak margin condition and relatively light-tailed context distributions. The second method uses adaptive $k$. By a proper data-driven selection of $k$, this method achieves an expected regret of $\tilde{O}\left(T^{1-\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta}}+T^{1-\beta}\right)$, in which $\beta$ is a parameter describing the tail strength. This bound matches the minimax lower bound up to logarithm factors, indicating that the second method is approximately optimal.
翻译:非参数上下文赌博机是序列决策问题的重要模型。在$\alpha$-Tsybakov边界条件下,现有研究已针对有界支撑集建立了$\tilde{O}\left(T^{1-\frac{\alpha+1}{d+2}}\right)$的遗憾界。然而,针对无界上下文的最优遗憾尚未得到分析。解决具有无界支撑集的上下文赌博机问题的挑战在于同时实现探索-利用权衡与偏差-方差权衡。本文解决了具有无界上下文的非参数上下文赌博机问题。我们提出了两种结合UCB探索机制的最近邻方法:第一种方法采用固定参数$k$,分析表明该方法在弱边界条件和相对轻尾的上下文分布下可实现极小极大最优遗憾;第二种方法采用自适应参数$k$,通过数据驱动的参数选择,该方法可获得$\tilde{O}\left(T^{1-\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta}}+T^{1-\beta}\right)$的期望遗憾,其中$\beta$是描述尾部强度的参数。该遗憾界在忽略对数因子后与极小极大下界匹配,表明第二种方法具有近似最优性。