When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.
翻译:在探索通用人工智能(AGI)发展的过程中,一个关键任务涉及模型对多图像输入信息的解读与处理。然而,大型多模态模型(LMMs)在此类场景中面临两大问题:(1)缺乏细粒度感知能力;(2)容易混淆多图像间的信息。我们首先深入研究了LMMs在处理多输入图像时感知细粒度视觉细节的能力,研究聚焦于两个维度:其一为图像-图像匹配(评估LMMs能否有效推理并配对相关图像),其二为多图像-文本匹配(评估LMMs能否准确捕捉并归纳图像细节信息)。我们对包括GPT-4V、Gemini、OpenFlamingo和MMICL在内的多款开源与闭源大型模型进行了评估。为提升模型性能,我们进一步提出了一种基于多输入多模态模型的对比思维链(CoCoT)提示方法。该方法要求LMMs先对比多图像输入间的相似性与差异性,继而引导模型基于识别出的相似与差异特点,回答关于多图像输入的细节问题。实验结果表明,CoCoT在增强大型多模态模型的多图像理解能力方面表现出显著成效。