Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of noteworthy contributions in recent months. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets are collected. However, a prevailing challenge persists in these approaches, specifically in relation to the limited visual perception ability, as CLIP-like encoders employed for extracting visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, we introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline, aiming to provide a more comprehensive and accurate summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception achieved through the integration of visual experts.
翻译:多模态大语言模型正在快速发展,近几个月涌现出大量值得关注的成果。当前主流趋势采用数据驱动方法,通过收集多样化的指令遵循数据集进行训练。然而,这些方法始终面临一个关键挑战,即视觉感知能力受限,因为用于提取输入视觉信息的编码器(如CLIP类编码器)存在局限性。尽管这些编码器在数十亿图像-文本对上进行了预训练,但由于文本描述仅能部分捕捉图像内容,它们仍然面临信息损失困境。为解决这一局限,本文提出通过混合专家知识增强机制来提升多模态大语言模型的视觉感知能力。具体而言,我们引入了一种创新方法,将多任务编码器和视觉工具融入现有MLLMs的训练与推理流程,旨在对视觉输入提供更全面、准确的总结。大量实验验证了该方法在推动MLLMs发展方面的有效性,展示了通过整合视觉专家所实现的视觉感知性能提升。