Recent advances such as LLaVA and Mini-GPT4 have successfully integrated visual information into LLMs, yielding inspiring outcomes and giving rise to a new generation of multi-modal LLMs, or MLLMs. Nevertheless, these methods struggle with hallucinations and the mutual interference between tasks. To tackle these problems, we propose an efficient and accurate approach to adapt to downstream tasks by utilizing LLM as a bridge to connect multiple expert models, namely u-LLaVA. Firstly, we incorporate the modality alignment module and multi-task modules into LLM. Then, we reorganize or rebuild multi-type public datasets to enable efficient modality alignment and instruction following. Finally, task-specific information is extracted from the trained LLM and provided to different modules for solving downstream tasks. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also release our model, the generated data, and the code base publicly available.
翻译:近期LLaVA和Mini-GPT4等研究进展成功将视觉信息集成到大型语言模型中(LLMs),取得了鼓舞人心的成果,催生了新一代多模态LLM(MLLM)。然而,这些方法仍面临幻觉现象与任务间相互干扰的挑战。为解决这些问题,我们提出了一种高效精准的适配方法,通过利用LLM作为连接多个专家模型的桥梁,即u-LLaVA。首先,我们将模态对齐模块和多任务模块融入LLM。然后,我们重组或重建多类型公共数据集,以实现高效的模态对齐与指令遵循。最后,从训练好的LLM中提取任务特定信息,提供给不同模块以求解下游任务。该整体框架简洁高效,在多项基准测试中达到最优性能。我们还公开了模型、生成的数据集及代码库。