We introduce a novel linear bandit problem with partially observable features, resulting in partial reward information and spurious estimates. Without proper address for latent part, regret possibly grows linearly in decision horizon $T$, as their influence on rewards are unknown. To tackle this, we propose a novel analysis to handle the latent features and an algorithm that achieves sublinear regret. The core of our algorithm involves (i) augmenting basis vectors orthogonal to the observed feature space, and (ii) introducing an efficient doubly robust estimator. Our approach achieves a regret bound of $\tilde{O}(\sqrt{(d + d_h)T})$, where $d$ is the dimension of observed features, and $d_h$ is the unknown dimension of the subspace of the unobserved features. Notably, our algorithm requires no prior knowledge of the unobserved feature space, which may expand as more features become hidden. Numerical experiments confirm that our algorithm outperforms both non-contextual multi-armed bandits and linear bandit algorithms depending solely on observed features.
翻译:我们提出了一种新颖的部分可观测特征的线性赌博机问题,这导致了部分奖励信息和伪估计。若不对潜在部分进行恰当处理,遗憾可能在决策范围 $T$ 内线性增长,因为它们对奖励的影响是未知的。为了解决这个问题,我们提出了一种新颖的分析方法来处理潜在特征,以及一个能实现次线性遗憾的算法。我们算法的核心包括:(i)增广与已观测特征空间正交的基向量,以及(ii)引入一种高效的双重稳健估计器。我们的方法实现了 $\tilde{O}(\sqrt{(d + d_h)T})$ 的遗憾界,其中 $d$ 是已观测特征的维度,$d_h$ 是未观测特征子空间的未知维度。值得注意的是,我们的算法不需要关于未观测特征空间的先验知识,该空间可能随着更多特征被隐藏而扩展。数值实验证实,我们的算法优于仅依赖已观测特征的非上下文多臂赌博机和线性赌博机算法。