Researchers and practitioners operating on a limited budget face the cost-performance trade-off dilemma. The challenging decision often centers on whether to use a large LLM with better performance or a smaller one with reduced costs. This has motivated recent research in the optimisation of LLM calls. Either a cascading strategy is used, where a smaller LLM or both are called sequentially, or a routing strategy is used, where only one model is ever called. Both scenarios are dependent on a decision criterion which is typically implemented by an extra neural model. In this work, we propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. We compare our approach with both cascading and routing strategies using three different pairs of pre-trained small and large LLMs, on nine different tasks and against approaches that require an additional neural model. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
翻译:预算有限的研究者和从业者常面临成本与性能的权衡困境,核心决策往往在于:是选用性能更优的大模型,还是成本更低的小模型。这推动了近期关于大语言模型调用优化的研究。现有策略分为两类:级联策略(依次调用小模型或两者)或路由策略(仅调用单一模型)。两种策略均依赖由额外神经模型实现的决策准则。本文提出简化方案——仅以小模型生成结果的不确定性作为决策准则。我们在九类任务中,使用三组不同规模预训练模型对,将本方法与需要额外神经模型的级联策略和路由策略进行对比。实验表明,这种简单方法能最优平衡成本与性能,在27个实验设置中优于现有方法的25个。