MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.

翻译：随着基础模型和视觉语言模型的进步，以及高效微调技术的发展，针对各类视觉任务已开发出大量通用及专用模型。尽管这些模型具备灵活性和可访问性，但单个模型仍无法处理潜在用户可能设想的全部任务和/或应用场景。近期方法（如视觉编程和集成工具的多模态大语言模型）试图通过程序合成应对复杂视觉任务，但这些方法忽视了用户约束（如性能/计算需求），生成难以部署的测试时样本特定解决方案，且有时需要超出普通用户能力的底层指令。为突破这些局限，我们提出MMFactory——一个包含模型与度量路由组件的通用框架，其功能类似于跨可用模型的解决方案搜索引擎。基于任务描述、少量输入输出示例及（可选的）资源与性能约束，MMFactory可通过实例化并组合其模型库中的视觉语言工具，生成多样化的程序化解决方案池。除合成解决方案外，MMFactory还能推荐评估指标并基准测试性能/资源特性，使用户能根据独特设计约束选择方案。技术层面，我们引入了基于委员会机制的方案提议器，利用多智能体大语言模型对话生成可执行、多样化、通用且鲁棒的用户解决方案。实验结果表明，MMFactory通过提供适应用户问题需求的先进解决方案，性能优于现有方法。项目页面详见 https://davidhalladay.github.io/mmfactory_demo。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日