分析与提升多模态大语言模型的细粒度视觉识别能力 (Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models)

Multi-modal large language models (MLLMs) have shown remarkable abilities in various visual understanding tasks. However, MLLMs still struggle with fine-grained visual recognition (FGVR), which aims to identify subordinate-level categories from images. This can negatively impact more advanced capabilities of MLLMs, such as object-centric visual question answering and reasoning. In our study, we revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase. We employ contrastive learning on object-attribute pairs and attribute-category pairs simultaneously and use examples from similar but incorrect categories as hard negatives, naturally bringing representations of visual objects and category names closer. Extensive evaluations across multiple popular FGVR datasets demonstrate that Finedefics outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code is available at https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025.

翻译：多模态大语言模型（MLLMs）在各种视觉理解任务中展现出卓越的能力。然而，MLLMs在细粒度视觉识别（FGVR）方面仍存在困难，该任务旨在从图像中识别下属层级的类别。这可能会对MLLMs更高级的能力产生负面影响，例如以对象为中心的视觉问答和推理。在本研究中，我们重新审视了MLLMs在FGVR方面的三种核心能力，包括对象信息提取、类别知识储备、对象-类别对齐，并将根本原因定位为一种错位问题。为解决此问题，我们提出了Finedefics，这是一种通过在训练阶段融入对象的信息化属性描述来增强模型FGVR能力的MLLM。我们同时对对象-属性对和属性-类别对进行对比学习，并使用来自相似但错误类别的样本作为困难负例，自然地拉近视觉对象与类别名称的表征距离。在多个流行FGVR数据集上的广泛评估表明，Finedefics在参数量相当的情况下优于现有的MLLMs，展示了其显著的有效性。代码可在 https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日