Developing foundational large language models (LLMs) is becoming increasingly costly and inefficient. Also, closed-source and larger open-source models generally offer better response quality but come with higher inference costs than smaller models. In this paper, we introduce Routoo, an architecture designed to optimize the selection of LLMs for specific prompts based on performance, cost, and efficiency. Routoo consists of two key components: a performance predictor and a cost-aware decoding. The performance predictor is a lightweight LLM that estimates the performance of various underlying LLMs without needing to execute and evaluate them. The cost-aware decoding then selects the most suitable model based on these predictions and other constraints like cost and latency. We evaluated Routoo using the MMLU benchmark across 57 domains employing open-source models. Our results show that Routoo matches the performance of the Mixtral 8x7b model while reducing inference costs by one-third. Additionally, by allowing increased costs, Routoo surpasses Mixtral's accuracy by over 5% at equivalent costs, achieving an accuracy of 75.9%. When integrating GPT4 into our model pool, Routoo nearly matches GPT4's performance at half the cost and exceeds it with a 25% cost reduction. These outcomes highlight Routoo's potential to create new SOTA in a cost-effective manner by leveraging the collective knowledge of multiple LLMs.
翻译:开发基础性大型语言模型(LLM)正变得日益昂贵且低效。同时,闭源模型和较大规模的开源模型通常能提供更优的响应质量,但其推理成本也高于较小模型。本文提出Routoo架构,该架构旨在根据性能、成本与效率指标,为特定提示词优化选择最合适的LLM。Routoo包含两个核心组件:性能预测器与成本感知解码器。性能预测器是一个轻量级LLM,无需实际执行和评估即可预测各底层LLM的性能表现;成本感知解码器则依据这些预测结果,结合成本、延迟等约束条件选择最优模型。我们在MMLU基准测试的57个领域中采用开源模型对Routoo进行评估。实验结果表明:Routoo在保持与Mixtral 8x7b模型相当性能的同时,将推理成本降低了三分之一;若适当放宽成本限制,Routoo在同等成本下可超越Mixtral模型5%以上的准确率,达到75.9%。当将GPT4纳入模型池时,Routoo能以半额成本实现接近GPT4的性能表现,并在成本降低25%的情况下超越其性能。这些成果彰显了Routoo通过整合多LLM集体智慧,以经济高效的方式创造新性能标杆的潜力。