Large Language Models (LLMs) are increasingly deployed in edge-cloud inference systems to handle diverse user tasks with heterogeneous accuracy, latency, and cost profiles. Selecting the appropriate LLM for each incoming task is critical for ensuring service quality and efficient resource utilization. However, model heterogeneity, stochastic and unknown performance characteristics, and time-varying task demands make static selection strategies inadequate. Real-world deployments often impose hard resource budgets such as monetary expenditure limits, along with soft service-level requirements such as latency guarantees. These constraints introduce additional challenges for online decision-making. We formulate this problem as a constrained stochastic bandit learning task, where the learner sequentially selects models under both packing-type (hard) and covering-type (soft) constraints, while adapting to time-varying task demand. The learner operates without access to the underlying reward, cost, or latency distributions and must rely on partial feedback. We develop a novel online learning algorithm that leverages confidence-bound estimates and demand predictions to balance reward maximization with long-term constraint satisfaction. We provide theoretical guarantees showing sublinear regret and sublinear covering constraint violations compared to an offline benchmark with full information. Experimental results on synthetic workloads demonstrate the effectiveness and robustness of our approach in dynamic, resource-constrained environments.
翻译:大型语言模型(LLMs)越来越多地部署在边缘-云推理系统中,以处理具有异构准确性、延迟和成本特征的多样化用户任务。为每个传入任务选择合适的LLM对于确保服务质量和高效资源利用至关重要。然而,模型异构性、随机且未知的性能特征以及时变任务需求使得静态选择策略难以满足实际需求。实际部署通常施加硬资源预算(如货币支出上限)和软服务水平要求(如延迟保证),这些约束为在线决策带来了额外挑战。本文将问题建模为约束随机赌博机学习任务:学习器需同时在打包型(硬约束)和覆盖型(软约束)限制下顺序选择模型,并适应时变任务需求。学习器无法获取潜在奖励、成本或延迟分布,仅能依赖部分反馈。我们提出了一种新型在线学习算法,利用置信区间估计和需求预测来平衡奖励最大化与长期约束满足。理论分析表明,与具有完整信息的离线基准相比,该算法能实现次线性遗憾和次线性覆盖约束违反。基于合成工作负载的实验结果验证了该方法在动态资源受限环境中的有效性和鲁棒性。