Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.
翻译:大型多模态模型(LMMs)常常难以识别新概念,因为它们依赖于预训练知识,且捕捉细微视觉细节的能力有限。训练中存在的领域特定知识缺口也使得它们容易混淆视觉上相似、常被错误表征或资源稀缺的概念。为了帮助LMMs更好地将细微的视觉特征与语言对齐,提升其识别和推理新颖或稀有概念的能力,我们提出了一种对比视觉数据增强(CoDA)策略。CoDA针对目标概念相对于其被误认为的已知概念,提取关键的对比性文本和视觉特征,然后利用多模态生成模型生成有针对性的合成数据。通过自动过滤提取的特征和增强图像来保证其质量,这一点已由人工标注者验证。我们在资源稀缺概念和多样化场景识别数据集(包括iNaturalist和SUN)上展示了CoDA的有效性和效率。此外,我们收集了NovelSpecies,这是一个由新发现的动物物种组成的基准数据集,确保这些物种是LMMs未曾见过的。在这三个数据集上,LLaVA-1.6的单样本更新结果显示,CoDA在准确率上显著优于当前最先进的视觉数据增强策略,分别取得了12.3%(NovelSpecies)、5.1%(SUN)和6.0%(iNat)的绝对提升。