Compound AI systems that combine multiple LLM calls, such as self-refine and multi-agent-debate, achieve strong performance on many AI tasks. We address a core question in optimizing compound systems: for each LLM call or module in the system, how should one decide which LLM to use? We show that these LLM choices have a large effect on quality, but the search space is exponential. We propose LLMSelector, an efficient framework for model selection in compound systems, which leverages two key empirical insights: (i) end-to-end performance is often monotonic in how well each module performs, with all other modules held fixed, and (ii) per-module performance can be estimated accurately by an LLM. Building upon these insights, LLMSelector iteratively selects one module and allocates to it the model with the highest module-wise performance, as estimated by an LLM, until no further gain is possible. LLMSelector is applicable to any compound system with a bounded number of modules, and its number of API calls scales linearly with the number of modules, achieving high-quality model allocation both empirically and theoretically. Experiments with popular compound systems such as multi-agent debate and self-refine using LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector confers 5%-70% accuracy gains compared to using the same LLM for all modules.
翻译:结合了多个大语言模型调用的复合AI系统,例如自优化和多智能体辩论,在许多AI任务上实现了强大的性能。我们解决了优化复合系统的一个核心问题:对于系统中的每个大语言模型调用或模块,应如何决定使用哪个大语言模型?我们证明这些大语言模型的选择对质量有重大影响,但搜索空间是指数级的。我们提出了LLMSelector,一个用于复合系统中模型选择的高效框架,该框架利用了以下两个关键的经验性洞见:(i) 在固定所有其他模块的情况下,端到端性能通常与每个模块的性能单调相关;(ii) 模块级性能可以通过一个大语言模型进行准确估计。基于这些洞见,LLMSelector迭代地选择一个模块,并根据一个大语言模型的估计,为其分配具有最高模块级性能的模型,直到无法获得进一步的增益。LLMSelector适用于任何具有有限数量模块的复合系统,其API调用次数与模块数量呈线性关系,在实证和理论上均实现了高质量的模型分配。使用诸如GPT-4o、Claude 3.5 Sonnet和Gemini 1.5等大语言模型,在多智能体辩论和自优化等流行复合系统上进行的实验表明,与所有模块使用相同大语言模型相比,LLMSelector带来了5%-70%的准确率提升。