Active learning (AL) is an effective approach to select the most informative samples to label so as to reduce the annotation cost. Existing AL methods typically work under the closed-set assumption, i.e., all classes existing in the unlabeled sample pool need to be classified by the target model. However, in some practical clinical tasks, the unlabeled pool may contain not only the target classes that need to be fine-grainedly classified, but also non-target classes that are irrelevant to the clinical tasks. Existing AL methods cannot work well in this scenario because they tend to select a large number of non-target samples. In this paper, we formulate this scenario as an open-set AL problem and propose an efficient framework, OpenAL, to address the challenge of querying samples from an unlabeled pool with both target class and non-target class samples. Experiments on fine-grained classification of pathology images show that OpenAL can significantly improve the query quality of target class samples and achieve higher performance than current state-of-the-art AL methods. Code is available at https://github.com/miccaiif/OpenAL.
翻译:主动学习是一种有效选择最具信息量样本来标注以降低标注成本的方法。现有主动学习方法通常基于封闭集假设,即未标注样本池中的所有类别都需要被目标模型分类。然而,在一些实际临床任务中,未标注样本池可能不仅包含需要细粒度分类的目标类别,还可能包含与临床任务无关的非目标类别。现有主动学习方法在此场景下效果不佳,因为它们倾向于选取大量非目标样本。本文将该场景形式化为开放集主动学习问题,并提出了高效框架OpenAL,以应对从未标注样本池中同时包含目标类别与非目标类别样本时查询样本的挑战。在病理图像细粒度分类实验表明,OpenAL能够显著提升目标类别样本的查询质量,并取得优于当前最先进主动学习方法的性能。代码已开源在 https://github.com/miccaiif/OpenAL。