Understanding multi-label images remains a challenging task in computer vision. With the rapid progress of vision-language multimodal learning, vision-language models (VLMs) enable zero-shot recognition without labeled data. However, due to their intrinsic design, these models often prioritize the most iconic object and omit other contextual positives. This intrinsic bias conflicts with the nature of multi-label learning, thereby limiting their applicability. In this work, we propose an unsupervised framework that adapts VLMs from iconic recognition toward inclusive understanding, enabling label-free multi-label image recognition. Our approach consists of two key stages, ``cutting'' and ``sewing'': In the cutting stage, we present the multi-sampling response estimator to prevent the model from concentrating only on one single object. In the second sewing stage, the multi-object blend adaptation is introduced to adjust the labels to better conform to the multi-label distribution while preserving the intrinsic characteristics of the original model within only one epoch. Extensive experiments show that our framework significantly outperforms existing unsupervised approaches on four public datasets, even surpassing several representative weakly supervised baselines. These results demonstrate the potential of adapting pre-trained VLMs for more comprehensive visual understanding without manual annotations. Our code is publicly available at https://github.com/iCVTEAM/TailorCLIP.
翻译:理解多标签图像仍是计算机视觉中的一项挑战性任务。随着视觉-语言多模态学习的快速发展,视觉语言模型(VLMs)能够在无标签数据下实现零样本识别。然而,由于其固有设计,这些模型往往优先识别最典型的物体,而忽略其他上下文相关的正例。这种内在偏差与多标签学习的本质相矛盾,从而限制了其适用性。本文提出了一种无监督框架,将VLMs从典型识别调整为包容性理解,实现了无标签的多标签图像识别。我们的方法包括两个关键阶段:“裁剪”与“缝合”:在裁剪阶段,我们提出多采样响应估计器,防止模型仅聚焦于单一物体。在缝合阶段,引入多目标混合适应策略,在单轮训练内调整标签以更好地符合多标签分布,同时保留原模型的内在特性。大量实验表明,我们的框架在四个公开数据集上显著优于现有无监督方法,甚至超越了多个具有代表性的弱监督基线。这些结果证明了无需人工标注即可适配预训练VLM以实现更全面视觉理解的潜力。我们的代码已开源在https://github.com/iCVTEAM/TailorCLIP。