With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.
翻译:随着基础模型和视觉语言模型的进步,以及高效微调技术的发展,针对各类视觉任务已开发出大量通用及专用模型。尽管这些模型具备灵活性和可访问性,但单个模型仍无法处理潜在用户可能设想的全部任务和/或应用场景。近期方法(如视觉编程和集成工具的多模态大语言模型)试图通过程序合成应对复杂视觉任务,但这些方法忽视了用户约束(如性能/计算需求),生成难以部署的测试时样本特定解决方案,且有时需要超出普通用户能力的底层指令。为突破这些局限,我们提出MMFactory——一个包含模型与度量路由组件的通用框架,其功能类似于跨可用模型的解决方案搜索引擎。基于任务描述、少量输入输出示例及(可选的)资源与性能约束,MMFactory可通过实例化并组合其模型库中的视觉语言工具,生成多样化的程序化解决方案池。除合成解决方案外,MMFactory还能推荐评估指标并基准测试性能/资源特性,使用户能根据独特设计约束选择方案。技术层面,我们引入了基于委员会机制的方案提议器,利用多智能体大语言模型对话生成可执行、多样化、通用且鲁棒的用户解决方案。实验结果表明,MMFactory通过提供适应用户问题需求的先进解决方案,性能优于现有方法。项目页面详见 https://davidhalladay.github.io/mmfactory_demo。