There is a rapidly growing number of open-source Large Language Models (LLMs) and benchmark datasets to compare them. While some models dominate these benchmarks, no single model typically achieves the best accuracy in all tasks and use cases. In this work, we address the challenge of selecting the best LLM out of a collection of models for new tasks. We propose a new formulation for the problem, in which benchmark datasets are repurposed to learn a "router" model for this LLM selection, and we show that this problem can be reduced to a collection of binary classification tasks. We demonstrate the utility and limitations of learning model routers from various benchmark datasets, where we consistently improve performance upon using any single model for all tasks.
翻译:随着开源大语言模型(LLMs)及用于比较这些模型的基准数据集数量迅速增长,尽管某些模型在这些基准测试中占据主导地位,但没有任何单一模型能在所有任务和用例中达到最佳精度。本文针对从模型集合中为新任务选择最优大语言模型的挑战,提出了一种新的问题形式化方法:将基准数据集重新用于学习一个“路由”模型以实现大语言模型选择,并证明该问题可简化为一系列二分类任务。我们展示了从不同基准数据集中学习模型路由器的效用与局限性,相对于在所有任务中使用单一模型,该方法能持续提升性能。