Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search. We have also discovered that Mordal achieves about 69\% higher weighted Kendall's $τ$ on average than the state-of-the-art model selection method across diverse tasks.
翻译:摘要:将多种模态融入大型语言模型(LLMs)是增强其对非文本数据理解能力的有效途径,使其能够执行多模态任务。视觉语言模型(VLMs)因在医疗、机器人技术和无障碍领域等众多实际应用场景中展现出巨大潜力,成为增长最快的多模态模型类别。然而,尽管文献中的不同VLM在不同基准测试中表现出令人印象深刻的视觉能力,但这些模型均由人工专家精心设计,目前尚无自动化框架用于创建针对特定任务的多模态模型。我们提出Mordal——一种自动化多模态模型搜索框架,能够在无需人工干预的情况下,高效地为用户定义任务找到最优VLM。Mordal通过两方面实现这一目标:既减少了搜索过程中需要评估的候选模型数量,又最小化了对每个剩余候选模型的评估时间。实验评估表明,与网格搜索相比,Mordal在寻找给定问题的最优VLM时可将GPU耗时降低8.9倍至11.6倍。我们还发现,在不同任务中,Mordal在加权Kendall's τ系数上的平均表现比现有最优模型选择方法提升约69%。