Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
翻译:大语言模型在大多数自然语言处理任务中表现出色,但由于其规模庞大,需要昂贵的云服务器进行部署;而可部署在低成本设备(如边缘设备)上的小模型在响应质量方面往往存在差距。因此,本文提出了一种混合推理方法,该方法结合了二者的优势,以节约成本并保持质量。我们的方法使用一个路由器,根据预测的查询难度和所需的质量水平,将查询分配给小模型或大模型。所需的质量水平可在测试时动态调整,以根据场景需求在质量与成本之间无缝权衡。实验结果表明,我们的方法在响应质量不下降的前提下,可将对大模型的调用次数减少高达40%。