Multi-modal large language models (MLLMs) have shown remarkable abilities in various visual understanding tasks. However, MLLMs still struggle with fine-grained visual recognition (FGVR), which aims to identify subordinate-level categories from images. This can negatively impact more advanced capabilities of MLLMs, such as object-centric visual question answering and reasoning. In our study, we revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase. We employ contrastive learning on object-attribute pairs and attribute-category pairs simultaneously and use examples from similar but incorrect categories as hard negatives, naturally bringing representations of visual objects and category names closer. Extensive evaluations across multiple popular FGVR datasets demonstrate that Finedefics outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code is available at https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025.
翻译:多模态大语言模型(MLLMs)在各种视觉理解任务中展现出卓越的能力。然而,MLLMs在细粒度视觉识别(FGVR)方面仍存在困难,该任务旨在从图像中识别下属层级的类别。这可能会对MLLMs更高级的能力产生负面影响,例如以对象为中心的视觉问答和推理。在本研究中,我们重新审视了MLLMs在FGVR方面的三种核心能力,包括对象信息提取、类别知识储备、对象-类别对齐,并将根本原因定位为一种错位问题。为解决此问题,我们提出了Finedefics,这是一种通过在训练阶段融入对象的信息化属性描述来增强模型FGVR能力的MLLM。我们同时对对象-属性对和属性-类别对进行对比学习,并使用来自相似但错误类别的样本作为困难负例,自然地拉近视觉对象与类别名称的表征距离。在多个流行FGVR数据集上的广泛评估表明,Finedefics在参数量相当的情况下优于现有的MLLMs,展示了其显著的有效性。代码可在 https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025 获取。