Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data. Current MLLMs typically begin by using LLMs to decompose tasks into multiple subtasks, then employing individual pre-trained models to complete specific subtasks, and ultimately utilizing LLMs to integrate the results of each subtasks to obtain the results of the task. In real-world scenarios, when dealing with large projects, it is common practice to break down the project into smaller sub-projects, with different teams providing corresponding solutions or results. The project owner then decides which solution or result to use, ensuring the best possible outcome for each subtask and, consequently, for the entire project. Inspired by this, this study considers selecting multiple pre-trained models to complete the same subtask. By combining the results from multiple pre-trained models, the optimal subtask result is obtained, enhancing the performance of the MLLM. Specifically, this study first selects multiple pre-trained models focused on the same subtask based on distinct evaluation approaches, and then invokes these models in parallel to process input data and generate corresponding subtask results. Finally, the results from multiple pre-trained models for the same subtask are compared using the LLM, and the best result is chosen as the outcome for that subtask. Extensive experiments are conducted in this study using GPT-4 annotated datasets and human-annotated datasets. The results of various evaluation metrics adequately demonstrate the effectiveness of the proposed approach in this paper.
翻译:多模态大语言模型(MLLM)是指从大语言模型(LLM)扩展而来、具备处理和推理多模态数据能力的模型。当前MLLM通常先利用LLM将任务分解为多个子任务,然后使用独立的预训练模型完成特定子任务,最终通过LLM整合各子任务的结果以获取任务结果。在现实场景中处理大型项目时,常见做法是将项目拆分为较小的子项目,由不同团队提供相应方案或成果。项目负责人随后决定采用何种方案或结果,确保每个子任务以及整个项目都能获得最佳效果。受此启发,本研究考虑选用多个预训练模型完成同一子任务,通过融合多个预训练模型的结果获得最优子任务结果,从而提升MLLM的性能。具体而言,本研究首先基于不同的评估方法选取专注于同一子任务的多个预训练模型,随后并行调用这些模型处理输入数据并生成对应的子任务结果。最后,利用LLM比较多个预训练模型针对同一子任务生成的结果,从中选取最佳结果作为该子任务的输出。本研究使用GPT-4标注数据集和人工标注数据集开展了大量实验。各项评估指标的结果充分验证了本文所提方法的有效性。