With the rapid advancement of large language models (LLMs), the diversity of multi-LLM tasks and the variability in their pricing structures have become increasingly important, as costs can vary greatly between different LLMs. To tackle these challenges, we introduce the \textit{C2MAB-V}, a \underline{C}ost-effective \underline{C}ombinatorial \underline{M}ulti-armed \underline{B}andit with \underline{V}ersatile reward models for optimal LLM selection and usage. This online model differs from traditional static approaches or those reliant on a single LLM without cost consideration. With multiple LLMs deployed on a scheduling cloud and a local server dedicated to handling user queries, \textit{C2MAB-V} facilitates the selection of multiple LLMs over a combinatorial search space, specifically tailored for various collaborative task types with different reward models. Based on our designed online feedback mechanism and confidence bound technique, \textit{C2MAB-V} can effectively address the multi-LLM selection challenge by managing the exploration-exploitation trade-off across different models, while also balancing cost and reward for diverse tasks. The NP-hard integer linear programming problem for selecting multiple LLMs with trade-off dilemmas is addressed by: i) decomposing the integer problem into a relaxed form by the local server, ii) utilizing a discretization rounding scheme that provides optimal LLM combinations by the scheduling cloud, and iii) continual online updates based on feedback. Theoretically, we prove that \textit{C2MAB-V} offers strict guarantees over versatile reward models, matching state-of-the-art results for regret and violations in some degenerate cases. Empirically, we show that \textit{C2MAB-V} effectively balances performance and cost-efficiency with nine LLMs for three application scenarios.
翻译:随着大语言模型(LLM)的快速发展,多LLM任务的多样性及其定价结构的可变性变得日益重要,因为不同LLM之间的成本差异可能很大。为应对这些挑战,我们提出了\textit{C2MAB-V},一种具有通用奖励模型的成本效益组合多臂老虎机,用于实现最优的LLM选择与使用。该在线模型不同于传统的静态方法或那些不考虑成本、仅依赖单一LLM的方法。通过将多个LLM部署在调度云上,并利用本地服务器专门处理用户查询,\textit{C2MAB-V}能够在组合搜索空间中选择多个LLM,特别适用于具有不同奖励模型的各种协作任务类型。基于我们设计的在线反馈机制和置信区间技术,\textit{C2MAB-V}能够通过管理不同模型间的探索-利用权衡,有效应对多LLM选择挑战,同时为多样化的任务平衡成本与奖励。对于存在权衡困境的多LLM选择这一NP难整数线性规划问题,我们通过以下方式解决:i) 由本地服务器将整数问题分解为松弛形式,ii) 利用调度云提供的离散化舍入方案来获得最优LLM组合,以及iii) 基于反馈进行持续的在线更新。理论上,我们证明了\textit{C2MAB-V}为通用奖励模型提供了严格的理论保证,在某些退化情形下其遗憾和约束违反的界与现有最优结果相匹配。实证上,我们展示了\textit{C2MAB-V}在三种应用场景中,使用九个LLM有效地平衡了性能与成本效益。