Which LLM to Play? Convergence-Aware Online Model Selection with Time-Increasing Bandits

Web-based applications such as chatbots, search engines and news recommendations continue to grow in scale and complexity with the recent surge in the adoption of LLMs. Online model selection has thus garnered increasing attention due to the need to choose the best model among a diverse set while balancing task reward and exploration cost. Organizations faces decisions like whether to employ a costly API-based LLM or a locally finetuned small LLM, weighing cost against performance. Traditional selection methods often evaluate every candidate model before choosing one, which are becoming impractical given the rising costs of training and finetuning LLMs. Moreover, it is undesirable to allocate excessive resources towards exploring poor-performing models. While some recent works leverage online bandit algorithm to manage such exploration-exploitation trade-off in model selection, they tend to overlook the increasing-then-converging trend in model performances as the model is iteratively finetuned, leading to less accurate predictions and suboptimal model selections. In this paper, we propose a time-increasing bandit algorithm TI-UCB, which effectively predicts the increase of model performances due to finetuning and efficiently balances exploration and exploitation in model selection. To further capture the converging points of models, we develop a change detection mechanism by comparing consecutive increase predictions. We theoretically prove that our algorithm achieves a logarithmic regret upper bound in a typical increasing bandit setting, which implies a fast convergence rate. The advantage of our method is also empirically validated through extensive experiments on classification model selection and online selection of LLMs. Our results highlight the importance of utilizing increasing-then-converging pattern for more efficient and economic model selection in the deployment of LLMs.

翻译：基于网络的应用，如聊天机器人、搜索引擎和新闻推荐，随着最近大语言模型的广泛采用，其规模和复杂性持续增长。因此，在线模型选择日益受到关注，因为它需要在多样化的模型集合中挑选最佳模型，同时平衡任务收益与探索成本。组织面临诸如是采用成本高昂的基于API的大语言模型，还是本地微调的小规模大语言模型等决策，需要在成本与性能之间权衡。传统的选择方法通常先评估每个候选模型再做出选择，但鉴于训练和微调大语言模型的成本不断上升，这些方法正变得不切实际。此外，将过多资源分配给表现不佳的模型进行探索也是不可取的。虽然近期一些工作利用在线赌博机算法来管理模型选择中的这种探索-利用权衡，但它们往往忽略了模型性能随着迭代微调而呈现的“先增后收敛”趋势，导致预测不够准确，模型选择次优。在本文中，我们提出一种时间递增赌博机算法TI-UCB，该算法能有效预测因微调导致的模型性能提升，并高效地在模型选择中平衡探索与利用。为了进一步捕捉模型的收敛点，我们通过比较连续的增幅预测，开发了一种变化检测机制。我们从理论上证明，在典型的递增赌博机设定下，我们的算法实现了对数级别的遗憾上界，这意味着较快的收敛速度。通过在分类模型选择和在线选择大语言模型的广泛实验，我们方法的优势也得到实证验证。我们的结果强调了利用“先增后收敛”模式对于在大语言模型部署中实现更高效、更经济的模型选择的重要性。