Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we address this critical limitation by introducing a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits. We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR, that achieve decent regrets under standard assumptions. Notably, our ESTOR can obtain the nearly optimal regret bound $\tilde{O}_T(\sqrt{T})$ in terms of the time horizon $T$. We then extend our methods to the high-dimensional sparse setting and show that the same regret rate can be attained with the sparsity index. Next, we introduce GSTOR, an algorithm that is agnostic to general reward functions, and establish regret bounds under a Gaussian design assumption. Finally, we validate the efficiency and effectiveness of our algorithms through experiments on both synthetic and real-world datasets.
翻译:广义线性赌博机因其在现实世界在线决策问题中的广泛应用而受到广泛研究。然而,这些方法通常假设期望奖励函数对用户是已知的,这一假设在实践中往往不切实际。对此链接函数的错误设定可能导致所有现有算法的失败。在本工作中,我们通过引入一个具有未知奖励函数的广义线性赌博机新问题(也称为单指数赌博机)来解决这一关键局限。我们首先考虑未知奖励函数单调递增的情况,并提出了两种新颖高效的算法——STOR和ESTOR,它们在标准假设下实现了可观的遗憾。值得注意的是,我们的ESTOR算法能够在时间范围$T$方面获得近乎最优的遗憾界$\tilde{O}_T(\sqrt{T})$。随后,我们将方法扩展到高维稀疏设定,并证明在稀疏性指标下可以实现相同的遗憾率。接着,我们引入了GSTOR算法,该算法对一般奖励函数具有不可知性,并在高斯设计假设下建立了遗憾界。最后,我们通过在合成数据集和真实世界数据集上的实验验证了所提算法的效率和有效性。