Linear bandits have long been a central topic in online learning, with applications ranging from recommendation systems to adaptive clinical trials. Their general learnability has been established when the objective is to minimise the inner product between a cost parameter and the decision variable. While this is highly general, this reliance on an inner product structure belies the name of \emph{linear} bandits, and fails to account for problems such as Optimal Transport. Using the Kantorovich formulation of Optimal Transport as an example, we show that an inner product structure is \emph{not} necessary to achieve efficient learning in linear bandits. We propose a refinement of the classical OFUL algorithm that operates by embedding the action set into a Hilbertian subspace, where confidence sets can be built via least-squares estimation. Actions are then constrained to this subspace by penalising optimism. The analysis is completed by leveraging convergence results from penalised (entropic) transport to the Kantorovich problem. Up to this approximation term, the resulting algorithm achieves the same trajectorial regret upper bounds as the OFUL algorithm, which we turn into worst-case regret using functional regression techniques. Its regret interpolates between $\tilde{\mathcal O}(\sqrt{T})$ and ${\mathcal O}(T)$, depending on the regularity of the cost function, and recovers the parametric rate $\tilde{\mathcal O}(\sqrt{dT})$ in finite-dimensional settings.
翻译:线性赌博机长期以来一直是在线学习的核心课题,其应用范围涵盖推荐系统至自适应临床试验。当目标是最小化成本参数与决策变量之间的内积时,其一般可学习性已得到确立。尽管这一框架具有高度普适性,但这种对内积结构的依赖与"线性"赌博机的名称并不完全相符,且无法涵盖如最优传输等特定问题。以最优传输的康托洛维奇表述为例,我们证明内积结构对于在线性赌博机中实现高效学习并非必要条件。我们提出对经典OFUL算法的改进方案,该方案通过将动作集嵌入希尔伯特子空间来构建置信集,其中置信集可通过最小二乘估计建立。随后通过惩罚乐观策略将动作约束于该子空间。通过利用惩罚(熵)传输至康托洛维奇问题的收敛结果完成理论分析。在近似项可忽略的前提下,改进算法实现了与OFUL算法相同的轨迹遗憾上界,我们进一步通过函数回归技术将其转化为最坏情况遗憾。该算法的遗憾界在$\tilde{\mathcal O}(\sqrt{T})$与${\mathcal O}(T)$之间变化,具体取决于成本函数的正则性,并在有限维场景中恢复参数速率$\tilde{\mathcal O}(\sqrt{dT})$。