Deep learning in general domains has constantly been extended to domain-specific tasks requiring the recognition of fine-grained characteristics. However, real-world applications for fine-grained tasks suffer from two challenges: a high reliance on expert knowledge for annotation and necessity of a versatile model for various downstream tasks in a specific domain (e.g., prediction of categories, bounding boxes, or pixel-wise annotations). Fortunately, the recent self-supervised learning (SSL) is a promising approach to pretrain a model without annotations, serving as an effective initialization for any downstream tasks. Since SSL does not rely on the presence of annotation, in general, it utilizes the large-scale unlabeled dataset, referred to as an open-set. In this sense, we introduce a novel Open-Set Self-Supervised Learning problem under the assumption that a large-scale unlabeled open-set is available, as well as the fine-grained target dataset, during a pretraining phase. In our problem setup, it is crucial to consider the distribution mismatch between the open-set and target dataset. Hence, we propose SimCore algorithm to sample a coreset, the subset of an open-set that has a minimum distance to the target dataset in the latent space. We demonstrate that SimCore significantly improves representation learning performance through extensive experimental settings, including eleven fine-grained datasets and seven open-sets in various downstream tasks.
翻译:通用领域的深度学习已持续扩展到需要识别细粒度特征的特定领域任务。然而,细粒度任务的现实应用面临两大挑战:高度依赖专家知识进行标注,以及需要适应特定领域中多种下游任务(如类别预测、边界框检测或像素级标注)的通用模型。幸运的是,近期发展的自监督学习(SSL)提供了一种无需标注即可预训练模型的有效方法,可为任何下游任务提供可靠的初始化。由于SSL不依赖标注信息,它通常利用大规模无标注数据集(称为开放集)进行训练。基于此,我们提出一个新的开放集自监督学习问题,假设在预训练阶段除细粒度目标数据集外,还存在大规模无标注的开放集。在该问题设置中,开放集与目标数据集之间的分布差异至关重要。为此,我们提出SimCore算法,通过采样得到核心集——即潜空间中与目标数据集距离最小的开放集子集。通过涵盖11个细粒度数据集和7种开放集的广泛实验设置(涉及多种下游任务),我们证明SimCore能显著提升表征学习性能。