In this study, we investigate the task of data pre-selection, which aims to select instances for labeling from an unlabeled dataset through a single pass, thereby optimizing performance for undefined downstream tasks with a limited annotation budget. Previous approaches to data pre-selection relied solely on visual features extracted from foundation models, such as CLIP and BLIP-2, but largely ignored the powerfulness of text features. In this work, we argue that, with proper design, the joint feature space of both vision and text can yield a better representation for data pre-selection. To this end, we introduce UP-DP, a simple yet effective unsupervised prompt learning approach that adapts vision-language models, like BLIP-2, for data pre-selection. Specifically, with the BLIP-2 parameters frozen, we train text prompts to extract the joint features with improved representation, ensuring a diverse cluster structure that covers the entire dataset. We extensively compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%. Interestingly, the prompts learned from one dataset demonstrate significant generalizability and can be applied directly to enhance the feature extraction of BLIP-2 from other datasets. To the best of our knowledge, UP-DP is the first work to incorporate unsupervised prompt learning in a vision-language model for data pre-selection.
翻译:本文研究数据预筛选任务,旨在通过单次遍历从无标注数据集中选取待标注样本,从而在有限标注预算下为未定义下游任务优化性能。现有数据预筛选方法仅依赖从CLIP、BLIP-2等基础模型中提取的视觉特征,却严重忽视了文本特征的强大能力。本研究论证:通过合理设计,视觉与文本的联合特征空间能为数据预筛选提供更优表征。为此,我们提出UP-DP——一种简洁有效的无监督提示学习方法,使BLIP-2等视觉-语言模型适配数据预筛选任务。具体而言,在冻结BLIP-2参数的前提下,我们训练文本提示以提取具有更优表征能力的联合特征,确保形成覆盖整个数据集的多样化聚类结构。我们在七项基准数据集上开展了不同设置下的全面对比实验,结果表明本方法相较于现有最优方案实现了高达20%的性能提升。值得关注的是,从某一数据集习得的提示词展现出显著泛化能力,可直接应用于增强BLIP-2在其他数据集上的特征提取效果。据我们所知,UP-DP是首个将无监督提示学习融入视觉-语言模型以实现数据预筛选的工作。