Explosive demands for LLMs often cause user queries to accumulate in server queues, requiring efficient routing (query-LLM matching) and scheduling (query prioritization) mechanisms. Several online algorithms are being deployed, but they overlook the following two key challenges inherent to conversational LLM services: (1) unsatisfied users may retry queries, increasing the server backlog, and (2) requests for ``explicit" feedback, such as ratings, degrade user experiences. In this paper, we develop a joint routing and scheduling algorithm that leverages ``implicit" feedback inferred from user retrial behaviors. The key idea is to propose and study the framework of contextual queueing bandits with multinomial logit feedback (CQB-MNL). CQB-MNL models query retrials, as well as context-based learning for user preferences over LLMs. Our algorithm, anytime CQB (ACQB), achieves efficient learning while maintaining queue stability by combining Thompson sampling with forced exploration at a decaying rate. We show that ACQB simultaneously achieves a cumulative regret of $\widetilde{\mathcal{O}}(\sqrt{t})$ for routing and a queue length regret of $\widetilde{\mathcal{O}}(t^{-1/4})$ for any large $t$. For experiments, we refine query embeddings via contrastive learning while adopting a disjoint parameter model to learn LLM-specific parameters. Experiments on SPROUT, EmbedLLM, and RouterBench datasets confirm that both algorithms consistently outperform baselines.
翻译:大型语言模型(LLM)的爆炸性需求常导致用户查询在服务器队列中积压,需要高效的路由(查询-LLM匹配)和调度(查询优先级排序)机制。现有在线算法常忽略对话式LLM服务固有的两个关键挑战:(1)未获满足的用户可能重试查询,加剧服务器积压;(2)请求“显式”反馈(如评分)会降低用户体验。本文提出一种联合路由调度算法,利用从用户重试行为推断的“隐式”反馈。核心创新在于提出并研究具有多项逻辑反馈的情境排队赌博机框架。该框架建模查询重试行为,并支持基于情境的用户对LLM偏好的学习。我们设计的任意时间CQB算法通过将汤普森采样与衰减式强制探索相结合,在保持队列稳定性的同时实现高效学习。理论证明表明,该算法在路由方面达到$\widetilde{\mathcal{O}}(\sqrt{t})$的累积遗憾,在任意大$t$时实现$\widetilde{\mathcal{O}}(t^{-1/4})$的队列长度遗憾。实验部分通过对比学习优化查询嵌入表示,并采用分离参数模型学习LLM特定参数。在SPROUT、EmbedLLM和RouterBench数据集上的实验表明,所提算法持续优于基线方法。