Pre-trained vision-language models have notably accelerated progress of open-world concept recognition. Their impressive zero-shot ability has recently been transferred to multi-label image classification via prompt tuning, enabling to discover novel labels in an open-vocabulary manner. However, this paradigm suffers from non-trivial training costs, and becomes computationally prohibitive for a large number of candidate labels. To address this issue, we note that vision-language pre-training aligns images and texts in a unified embedding space, making it potential for an adapter network to identify labels in visual modality while be trained in text modality. To enhance such cross-modal transfer ability, a simple yet effective method termed random perturbation is proposed, which enables the adapter to search for potential visual embeddings by perturbing text embeddings with noise during training, resulting in better performance in visual modality. Furthermore, we introduce an effective approach to employ large language models for multi-label instruction-following text generation. In this way, a fully automated pipeline for visual label recognition is developed without relying on any manual data. Extensive experiments on public benchmarks show the superiority of our method in various multi-label classification tasks.
翻译:预训练的视觉-语言模型显著推动了开放世界概念识别的进展。通过提示调优,这些模型强大的零样本能力近期被迁移至多标签图像分类,实现了以开放词汇方式发现新标签的能力。然而,该范式存在训练成本高昂的问题,当候选标签数量庞大时计算代价会变得难以承受。为解决这一问题,我们注意到视觉-语言预训练将图像与文本对齐在统一嵌入空间中,这使适配器网络具备潜在能力:在文本模态训练的同时,实现对视觉模态标签的识别。为增强这种跨模态迁移能力,我们提出一种简单有效的方法——随机扰动,通过在训练过程中向文本嵌入添加噪声扰动,使适配器能够搜索潜在的视觉嵌入,从而在视觉模态中获得更优性能。此外,我们引入了一种利用大语言模型进行多标签指令跟随文本生成的有效方法。由此构建出无需任何人工标注数据的全自动视觉标签识别流程。在公开基准上的大量实验表明,我们的方法在各类多标签分类任务中均展现出优越性。