Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80\% of queries to the small model while incurring less than 10\% drop in problem solving probability.
翻译:视觉语言模型(VLMs)在多种多模态任务中表现出色。然而,用户需求因场景而异,可分为快速响应、高质量输出和低能耗三类。完全依赖云端部署的大型模型处理所有查询通常会导致高延迟和高能耗,而部署在边缘设备上的小型模型能够以低延迟和低能耗处理较简单的任务。为充分发挥大小模型的优势,我们提出了ECVL-ROUTER,这是首个面向VLMs的场景感知路由框架。该方法引入了一种新的路由策略和评估指标,可根据用户需求为每个查询动态选择合适模型,从而最大化整体效用。我们还构建了一个专为路由器训练定制的多模态响应质量数据集,并通过大量实验验证了该方法的有效性。实验结果表明,我们的方法成功将超过80%的查询路由至小型模型,同时问题解决概率的下降幅度低于10%。