Modern recommendation systems increasingly rely on dynamically routing diverse queries to multiple embedding models. Despite its practical significance, this problem remains poorly understood under realistic conditions like adversarial queries, bandit feedback, and limited observability of models. We formalize embedding model routing as an adversarial contextual linear bandit with low-rank experts, where contexts are queries, actions are items, and experts are the embedding models working on low-rank latent representation spaces. We first establish that standard regret notions suffer from structural misspecification or statistical intractability, and we identify a log-quadratic policy class that is expressive enough to capture query-dependent model routing, yet structured enough to allow efficient online learning. Second, we propose a policy gradient algorithm called Hypentropy Policy Gradient (HPG). It provably adapts to the unknown low-rank structure under incomplete information and attains $\tilde{\mathcal O}(s\sqrt{M T})$ linearized policy regret -- where $s, M$, and $T$ are the intrinsic rank of the experts, the number of models, and the number of rounds -- thus avoiding a curse of dimensionality. Finally, we also provide an computationally efficient and parameter-free implementation of HPG.
翻译:现代推荐系统越来越依赖将多样化的查询动态路由到多个嵌入模型。尽管具有实际重要性,但在对抗性查询、赌博反馈和模型有限可观测性等现实条件下,该问题仍未得到充分理解。我们将嵌入模型路由形式化为具有低秩专家的对抗性上下文线性赌博机,其中上下文是查询、动作是物品、专家是在低秩潜在表示空间上工作的嵌入模型。我们首先证明标准遗憾概念存在结构性错误规范或统计不可解性,并识别出一类对数二次策略——其表达能力足以捕捉与查询相关的模型路由,同时结构化程度足以支持高效的在线学习。其次,我们提出了一种名为Hypentropy策略梯度(HPG)的策略梯度算法。该算法能在不完全信息下自适应未知低秩结构,并实现$\tilde{\mathcal O}(s\sqrt{M T})$的线性化策略遗憾——其中$s$、$M$和$T$分别为专家的内在秩、模型数量和回合数——从而避免了维度灾难。最后,我们还提供了HPG的计算高效且无参数的实现。