Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of noteworthy contributions in recent months. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets are collected. However, a prevailing challenge persists in these approaches, specifically in relation to the limited visual perception ability, as CLIP-like encoders employed for extracting visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, we introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline, aiming to provide a more comprehensive and accurate summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception achieved through the integration of visual experts.
翻译:多模态大语言模型(MLLMs)正在快速发展,近期涌现了大量值得关注的成果。当前的主流趋势是采用数据驱动的方法,收集多样化的指令遵循数据集。然而,这些方法仍面临一个普遍挑战,即视觉感知能力受限,这是因为用于提取输入视觉信息的编码器类似于CLIP。尽管这些编码器在数十亿图像-文本对上进行了预训练,但由于文本描述仅能部分捕捉图像所描绘的内容,它们仍然难以摆脱信息损失的困境。为了解决这一限制,本文提出通过混合专家知识增强机制来提升MLLMs的视觉感知能力。具体而言,我们引入了一种新颖的方法,将多任务编码器和视觉工具集成到现有的MLLMs训练与推理流程中,旨在提供更全面和准确的视觉输入总结。大量实验评估了该方法在推进MLLMs发展方面的有效性,展示了通过整合视觉专家所实现的视觉感知提升。