Enhancing Subtask Performance of Multi-modal Large Language Model

Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data. Current MLLMs typically begin by using LLMs to decompose tasks into multiple subtasks, then employing individual pre-trained models to complete specific subtasks, and ultimately utilizing LLMs to integrate the results of each subtasks to obtain the results of the task. In real-world scenarios, when dealing with large projects, it is common practice to break down the project into smaller sub-projects, with different teams providing corresponding solutions or results. The project owner then decides which solution or result to use, ensuring the best possible outcome for each subtask and, consequently, for the entire project. Inspired by this, this study considers selecting multiple pre-trained models to complete the same subtask. By combining the results from multiple pre-trained models, the optimal subtask result is obtained, enhancing the performance of the MLLM. Specifically, this study first selects multiple pre-trained models focused on the same subtask based on distinct evaluation approaches, and then invokes these models in parallel to process input data and generate corresponding subtask results. Finally, the results from multiple pre-trained models for the same subtask are compared using the LLM, and the best result is chosen as the outcome for that subtask. Extensive experiments are conducted in this study using GPT-4 annotated datasets and human-annotated datasets. The results of various evaluation metrics adequately demonstrate the effectiveness of the proposed approach in this paper.

翻译：多模态大语言模型（MLLM）是指从大语言模型（LLM）扩展而来、具备处理和推理多模态数据能力的模型。当前MLLM通常先利用LLM将任务分解为多个子任务，然后使用独立的预训练模型完成特定子任务，最终通过LLM整合各子任务的结果以获取任务结果。在现实场景中处理大型项目时，常见做法是将项目拆分为较小的子项目，由不同团队提供相应方案或成果。项目负责人随后决定采用何种方案或结果，确保每个子任务以及整个项目都能获得最佳效果。受此启发，本研究考虑选用多个预训练模型完成同一子任务，通过融合多个预训练模型的结果获得最优子任务结果，从而提升MLLM的性能。具体而言，本研究首先基于不同的评估方法选取专注于同一子任务的多个预训练模型，随后并行调用这些模型处理输入数据并生成对应的子任务结果。最后，利用LLM比较多个预训练模型针对同一子任务生成的结果，从中选取最佳结果作为该子任务的输出。本研究使用GPT-4标注数据集和人工标注数据集开展了大量实验。各项评估指标的结果充分验证了本文所提方法的有效性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日