In many machine learning applications, labeling datasets can be an arduous and time-consuming task. Although research has shown that semi-supervised learning techniques can achieve high accuracy with very few labels within the field of computer vision, little attention has been given to how images within a dataset should be selected for labeling. In this paper, we propose a novel approach based on well-established self-supervised learning, clustering, and manifold learning techniques that address this challenge of selecting an informative image subset to label in the first instance, which is known as the cold-start or unsupervised selective labelling problem. We test our approach using several publicly available datasets, namely CIFAR10, Imagenette, DeepWeeds, and EuroSAT, and observe improved performance with both supervised and semi-supervised learning strategies when our label selection strategy is used, in comparison to random sampling. We also obtain superior performance for the datasets considered with a much simpler approach compared to other methods in the literature.
翻译:在众多机器学习应用中,标注数据集往往是一项艰巨且耗时的任务。尽管研究表明,半监督学习技术在计算机视觉领域能够以极少的标注达到高准确率,但对于如何从数据集中选择图像进行标注这一问题,却鲜有关注。本文提出了一种基于成熟的自监督学习、聚类和流形学习技术的新方法,以解决首次选择信息量丰富的图像子集进行标注的挑战,即所谓的冷启动或无监督选择性标注问题。我们使用多个公开数据集(包括CIFAR10、Imagenette、DeepWeeds和EuroSAT)对方法进行了测试,并观察到与随机采样相比,采用我们的标签选择策略后,监督学习和半监督学习策略均获得了性能提升。同时,与文献中的其他方法相比,我们以更简单的方式在所考虑的数据集上取得了更优的性能。