Category discovery methods aim to find novel categories in unlabeled visual data. At training time, a set of labeled and unlabeled images are provided, where the labels correspond to the categories present in the images. The labeled data provides guidance during training by indicating what types of visual properties and features are relevant for performing discovery in the unlabeled data. As a result, changing the categories present in the labeled set can have a large impact on what is ultimately discovered in the unlabeled set. Despite its importance, the impact of labeled data selection has not been explored in the category discovery literature to date. We show that changing the labeled data can significantly impact discovery performance. Motivated by this, we propose two new approaches for automatically selecting the most suitable labeled data based on the similarity between the labeled and unlabeled data. Our observation is that, unlike in conventional supervised transfer learning, the best labeled is neither too similar, nor too dissimilar, to the unlabeled categories. Our resulting approaches obtains state-of-the-art discovery performance across a range of challenging fine-grained benchmark datasets.
翻译:类别发现方法旨在从未标注的视觉数据中发现新类别。在训练阶段,系统会提供一组已标注和未标注的图像,其中标注对应于图像中存在的类别。标注数据通过指示哪些类型的视觉属性与特征与未标注数据中的发现任务相关,从而在训练过程中提供指导。因此,改变标注集中存在的类别可能对未标注集中最终发现的类别产生重大影响。尽管其重要性显著,但迄今为止,类别发现研究领域尚未探讨标注数据选择的影响。我们证明,改变标注数据能够显著影响发现性能。基于这一发现,我们提出了两种根据标注数据与未标注数据之间的相似度自动选择最合适标注数据的新方法。我们的观察表明,与传统的监督迁移学习不同,最优的标注数据既不应与未标注类别过于相似,也不应过于相异。我们提出的方法在一系列具有挑战性的细粒度基准数据集上实现了最先进的发现性能。