Large Multimodal Models (LMMs) have shown promise in vision-language tasks but struggle with high-resolution input and detailed scene understanding. Addressing these challenges, we introduce Monkey to enhance LMM capabilities. Firstly, Monkey processes input images by dividing them into uniform patches, each matching the size (e.g., 448x448) used in the original training of the well-trained vision encoder. Equipped with individual adapter for each patch, Monkey can handle higher resolutions up to 1344x896 pixels, enabling the detailed capture of complex visual information. Secondly, it employs a multi-level description generation method, enriching the context for scene-object associations. This two-part strategy ensures more effective learning from generated data: the higher resolution allows for a more detailed capture of visuals, which in turn enhances the effectiveness of comprehensive descriptions. Extensive ablative results validate the effectiveness of our designs. Additionally, experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats. Specially, in qualitative tests focused on dense text question answering, Monkey has exhibited encouraging results compared with GPT4V. Code is available at https://github.com/Yuliang-Liu/Monkey.
翻译:大型多模态模型(LMMs)在视觉-语言任务中展现出潜力,但在高分辨率输入和细粒度场景理解方面仍面临挑战。针对这些问题,我们提出Monkey以增强LMM能力。首先,Monkey通过将输入图像划分为均匀的补丁进行处理,每个补丁尺寸(例如448×448)与预训练的视觉编码器原始训练尺寸保持一致。通过为每个补丁配备独立的适配器,Monkey可处理高达1344×896像素的分辨率,从而精细捕捉复杂视觉信息。其次,该方法采用多层级描述生成技术,丰富场景-对象关联的上下文信息。这种双模块策略能更有效地从生成数据中学习:更高分辨率使视觉信息捕捉更精细,进而增强综合描述的生成效果。大量消融实验验证了设计有效性。此外,在18个数据集上的实验表明,Monkey在图像描述生成、多种视觉问答格式等多项任务中优于现有LMM。特别在聚焦密集文本问答的定性测试中,Monkey展现出与GPT4V相媲美的竞争力。代码发布于https://github.com/Yuliang-Liu/Monkey。