Active learning (AL) aims to enhance model performance by selectively collecting highly informative data, thereby minimizing annotation costs. However, in practical scenarios, unlabeled data may contain out-of-distribution (OOD) samples, leading to wasted annotation costs if data is incorrectly selected. Recent research has explored methods to apply AL to open-set data, but these methods often require or incur unavoidable cost losses to minimize them. To address these challenges, we propose a novel selection strategy, CLIPN for AL (CLIPNAL), which minimizes cost losses without requiring OOD samples. CLIPNAL sequentially evaluates the purity and informativeness of data. First, it utilizes a pre-trained vision-language model to detect and exclude OOD data by leveraging linguistic and visual information of in-distribution (ID) data without additional training. Second, it selects highly informative data from the remaining ID data, and then the selected samples are annotated by human experts. Experimental results on datasets with various open-set conditions demonstrate that CLIPNAL achieves the lowest cost loss and highest performance across all scenarios. Code is available at https://github.com/DSBA-Lab/OpenAL.
翻译:主动学习(Active Learning, AL)旨在通过选择性收集高信息量数据来提升模型性能,从而最小化标注成本。然而,在实际场景中,未标注数据可能包含分布外(Out-of-Distribution, OOD)样本,若数据选择不当将导致标注成本浪费。近期研究探索了将AL应用于开放集数据的方法,但这些方法通常需要或不可避免地产生成本损失以最小化其影响。为应对这些挑战,我们提出一种新颖的选择策略——面向AL的CLIPN(CLIPNAL),该方法无需OOD样本即可最小化成本损失。CLIPNAL依次评估数据的纯度与信息量:首先,利用预训练的视觉语言模型,通过结合分布内(In-Distribution, ID)数据的语言与视觉信息检测并排除OOD数据,且无需额外训练;其次,从剩余ID数据中选择高信息量样本,随后由人类专家进行标注。在不同开放集条件下的数据集上的实验结果表明,CLIPNAL在所有场景中均实现了最低的成本损失与最高的性能。代码发布于 https://github.com/DSBA-Lab/OpenAL。