InfMLLM: A Unified Framework for Visual-Language Tasks

Large language models (LLMs) have proven their remarkable versatility in handling a comprehensive range of language-centric applications. To expand LLMs' capabilities to a broader spectrum of modal inputs, multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks, particularly image captioning, visual question answering (VQA,) and visual grounding. To this end, we implemented a three-stage training scheme: starting with lightweight alignment pretraining, then moderate-weight multitask hybrid training, and finally, LLM fine-tuning to improve instruction following capability. Throughout the training process, the requirements on GPU memory gradually increase. To effectively manage the number of visual embeddings passed to the LLM while preserving their positional information, we introduce a straightforward visual adapter module dubbed pool-adapter. Our experiments demonstrate that preserving the positional information of visual embeddings through the pool-adapter is particularly beneficial for tasks like visual grounding. We name our proposed approach InfMLLM and have evaluated it extensively on various benchmark datasets. Our results demonstrate that InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs. The code and model will be made open-source at: \url{https://github.com/mightyzau/InfMLLM}.

翻译：大型语言模型（LLM）在处理多种语言为中心的应用中已展现出显著的通用性。为将LLM的能力扩展至更广泛的模态输入，多模态大语言模型（MLLM）引起了越来越多的关注。本文深入探讨了如何使LLM应对更多视觉-语言相关任务，特别是图像描述、视觉问答（VQA）和视觉定位。为此，我们实现了一个三阶段训练方案：从轻量级对齐预训练开始，接着进行中等权重的多任务混合训练，最后通过LLM微调提升指令遵循能力。在整个训练过程中，GPU内存需求逐步增加。为有效管理传递给LLM的视觉嵌入数量并保留其位置信息，我们引入了一个名为池化适配器的简单视觉适配模块。实验表明，通过池化适配器保留视觉嵌入的位置信息对视觉定位等任务尤为有利。我们将所提出的方法命名为InfMLLM，并在多个基准数据集上进行了广泛评估。结果表明，InfMLLM达到了现有最优（SOTA）性能或与近期MLLM相当的性能。代码和模型将开源发布在：\url{https://github.com/mightyzau/InfMLLM}。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日